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After  that,  it  was  down  to  attitude. 

— Ian  Rankin,  Black  &  Blue. — 

The  purpose  of  this  book  is  to  provide  a  self-contained  entry  into  practical 
and  computational  Bayesian  statistics  using  generic  examples  from  the  most 
common  models  for  a  class  duration  of  about  seven  blocks  that  roughly  cor¬ 
respond  to  13-15  weeks  of  teaching  (with  three  hours  of  lectures  per  week), 
depending  on  the  intended  level  and  the  prerequisites  imposed  on  the  students. 
(That  estimate  does  not  include  practice — i.e.,  R  programming  labs,  writing 
data  reports — since  those  may  have  a  variable  duration,  also  depending  on 
the  students’  involvement  and  their  programming  abilities.)  The  emphasis  on 
practice  is  a  strong  commitment  of  this  book  in  that  its  primary  audience 
consists  of  graduate  students  who  need  to  use  (Bayesian)  statistics  as  a  tool 
to  analyze  their  experiments  and/or  datasets.  The  book  should  also  appeal 
to  scientists  in  all  fields  who  want  to  engage  into  Bayesian  statistics,  given 
the  versatility  of  the  Bayesian  tools.  Bayesian  essentials  can  also  be  used  for 
a  more  classical  statistics  audience  when  aimed  at  teaching  a  quick  entry  to 
Bayesian  statistics  at  the  end  of  an  undergraduate  program,  for  instance.  (Ob¬ 
viously,  it  can  supplement  another  textbook  on  data  analysis  at  the  graduate 
level.) 

This  book  is  an  extensive  revision  of  our  previous  book,  Bayesian  Core , 
which  appeared  in  2007,  aiming  at  the  same  goals.  (Glancing  at  this  earlier 
version  will  show  the  filiation  to  most  readers.)  However,  after  publishing 
Bayesian  Core  and  teaching  from  it  to  different  audiences,  we  soon  realized 
that  the  level  of  mathematics  therein  was  actually  more  involved  than  the  one 
expected  by  those  audiences.  Students  were  also  asking  for  more  advice  and 
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more  R  code  than  what  was  then  available.  We  thus  decided  upon  a  major 
revision,  producing  a  manual  that  cut  the  mathematics  and  expanded  the  R 
code,  changing  as  well  some  chapters  and  replacing  some  datasets.  We  had  at 
first  even  larger  ambitions  in  terms  of  contents,  but  had  eventually  to  sacrifice 
new  chapters  for  the  sake  of  completing  the  book  before  we  came  to  blows! 
To  stress  further  the  changes  from  the  2007  version,  we  also  decided  on  a  new 
title,  Bayesian  Essentials ,  that  was  actually  suggested  by  Andrew  Gelman 
during  a  visit  to  Paris. 

The  current  format  of  the  book  is  one  of  a  quick  coverage  of  the  topics, 
always  backed  by  a  motivated  problem  and  a  corresponding  dataset  (available 
in  the  associated  R  package,  bay  ess),  and  a  detailed  resolution  of  the  infer¬ 
ence  procedures  pertaining  to  this  problem,  always  including  commented  R 
programs  or  relevant  parts  of  R  programs.  Special  attention  is  paid  to  the 
derivation  of  prior  distributions,  and  operational  reference  solutions  are  pro¬ 
posed  for  each  model  under  study.  Additional  cases  are  proposed  as  exercises. 
The  spirit  is  not  unrelated  to  that  of  Nolan  and  Speed  (2000),  with  more  em¬ 
phasis  on  the  methodological  backgrounds.  While  the  datasets  are  inspired  by 
real  cases,  we  also  cut  on  their  description  and  the  motivations  for  their  anal¬ 
ysis.  The  current  format  thus  serves  as  a  unique  textbook  for  a  service  course 
for  scientists  aimed  at  analyzing  data  the  Bayesian  way  or  as  an  introductory 
course  on  Bayesian  statistics. 

Note  that  we  have  not  included  any  BUGS-oriented  hierarchical  analysis 
in  this  edition.  This  choice  is  deliberate:  We  have  instead  focussed  on  the 
Bayesian  processing  of  mostly  standard  statistical  models,  notably  in  terms 
of  prior  specification  and  of  the  stochastic  algorithms  that  are  required  to 
handle  Bayesian  estimation  and  model  choice  questions.  We  plainly  expect 
that  the  readers  of  our  book  will  have  no  difficulty  in  assimilating  the  BUGS 
philosophy,  relying,  for  instance,  on  the  highly  relevant  books  by  Lunn  et  al. 
(2012)  and  Gelman  et  al.  (2013). 

A  course  corresponding  to  the  book  has  now  been  taught  by  both  of  us 
for  several  years  in  a  second  year  master’s  program  for  students  aiming  at 
a  professional  degree  in  data  processing  and  statistics  (at  Universite  Paris 
Dauphine,  France)  as  well  as  in  several  US  and  Canadian  universities.  In  Paris 
Dauphine  the  first  half  of  the  book  was  used  in  a  6-week  (intensive)  program, 
and  students  were  tested  on  both  the  exercises  (meaning  all  exercises)  and 
their  (practical)  mastery  of  the  datasets,  the  stated  expectation  being  that 
they  should  go  beyond  a  mere  reproduction  of  the  R  outputs  presented  in 
the  book.  While  the  students  found  that  the  amount  of  work  required  by  this 
course  was  rather  beyond  their  usual  standards  (!),  we  observed  that  their 
understanding  and  mastery  of  Bayesian  techniques  were  much  deeper  and 
more  ingrained  than  in  the  more  formal  courses  their  counterparts  had  in  the 
years  before.  In  short,  they  started  to  think  about  the  purpose  of  a  Bayesian 
statistical  analysis  rather  than  on  the  contents  of  the  final  test  and  they  ended 
up  building  a  true  intuition  about  what  the  results  should  look  like,  intuition 
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that,  for  instance,  helped  them  to  detect  modeling  and  programming  errors! 
In  most  subjects,  working  on  Bayesian  statistics  from  this  perspective  created 
a  genuine  interest  in  the  approach  and  several  students  continued  to  use  this 
approach  in  later  courses  or,  even  better,  on  the  job. 

Exercises  are  now  focussed  on  solving  problems  rather  than  addressing 
finer  theoretical  points.  Solutions  to  about  half  of  the  exercises  are  freely 
available  on  our  webpages.  We  insist  upon  the  point  that  the  developments 
contained  in  those  exercises  are  often  relevant  for  fully  understanding  in  the 
chapter. 

Thanks 

We  are  immensely  grateful  to  colleagues  and  friends  for  their  help  with  this 
book  and  its  previous  version,  Bayesian  Core ,  in  particular,  to  the  follow¬ 
ing  people:  Francois  Perron  somehow  started  thinking  about  this  book  and 
did  a  thorough  editing  of  it  during  a  second  visit  to  Dauphine,  helping  us 
to  adapt  it  more  closely  to  North  American  audiences.  He  also  adopted 
Bayesian  Core  as  a  textbook  in  Montreal  as  soon  as  it  appeared.  George 
Casella  made  helpful  suggestions  on  the  format  of  the  book.  Jerome  Dupuis 
provided  capture-recapture  slides  that  have  been  recycled  in  Chap.  5.  Arnaud 
Doucet  taught  from  the  book  at  the  University  of  British  Columbia,  Van¬ 
couver.  Jean-Dominique  Lebreton  provided  the  European  dipper  dataset  of 
Chap.  5.  Gaelle  Lefol  pointed  out  the  Eurostoxx  series  as  a  versatile  dataset 
for  Chap.  7.  Kerrie  Mengersen  collaborated  with  both  of  us  on  a  review  paper 
about  mixtures  that  is  related  to  Chap.  6,  Jim  Kay  introduced  us  to  the  Lake 
of  Menteith  dataset.  Mike  Titterington  is  thanked  for  collaborative  friendship 
over  the  years  and  for  a  detailed  set  of  comments  on  the  book  (quite  in  tune 
with  his  dedicated  editorship  of  Biometrika).  Jean-Louis  Foulley  provided  us 
with  some  dataset  and  with  extensive  comments  on  their  Bayesian  process¬ 
ing.  Even  though  we  did  not  use  those  examples  in  the  end,  in  connection 
with  the  strategy  not  to  include  BUGS-oriented  materials,  we  are  indebted 
to  Jean-Louis  for  this  help.  Gilles  Celeux  carefully  read  the  manuscript  of 
the  first  edition  and  made  numerous  suggestions  on  both  content  and  style. 
Darren  Wraith,  Julyan  Arbel,  Marco  Banterle,  Robin  Ryder,  and  Sophie  Don- 
net  all  reviewed  some  chapters  or  some  R  code  and  provided  highly  relevant 
comments,  which  clearly  contributed  to  the  final  output.  The  picture  of  the 
caterpillar  nest  at  the  beginning  of  Chapter  3  was  taken  by  Brigitte  Plessis, 
Christian  P.  Robert’s  spouse,  near  his  great-grand- mother’s  house  in  Brittany. 

We  are  also  grateful  to  the  numerous  readers  who  sent  us  queries  about  po¬ 
tential  typos,  as  there  were  indeed  many  typos  and  if  not  unclear  statements. 
Thanks  in  particular  to  Jarrett  Barber,  Hossein  Gholami,  we  thus  encourage 
all  new  readers  of  Bayesian  Essentials  to  do  the  same! 

The  second  edition  of  Bayesian  Core  was  started,  thanks  to  the  support  of 
the  Centre  International  de  Rencontres  Mathematiques  (CIRM),  sponsored 
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by  both  the  Centre  National  de  la  Recherche  Scientifique  (CNRS)  and  the 
Societe  Mathematique  de  France  (SMF),  located  on  the  Luminy  campus  near 
Marseille.  Being  able  to  work  “in  pair”  in  the  center  for  2  weeks  was  an 
invaluable  opportunity,  boosted  by  the  lovely  surroundings  of  the  Calanques, 
where  mountain  and  sea  meet!  The  help  provided  by  the  CIRM  staff  during 
the  stay  is  also  most  gratefully  acknowledged. 

Montpellier,  France  Jean-Michel  Marin 

Paris,  France  Christian  P.  Robert 

September  19,  2013 
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User’s  Manual 


The  bare  essentials,  in  other  words. 

— Ian  Rankin,  Tooth  &  Nail. — 


Roadmap 

The  Roadmap  is  a  section  that  will  start  each  chapter  by  providing  a  commented 
table  of  contents.  It  also  usually  contains  indications  on  the  purpose  of  the  chapter. 

For  instance,  in  this  initial  chapter,  we  explain  the  typographical  notations  that 
we  adopted  to  distinguish  between  the  different  semantic  levels  of  the  course. 
We  also  try  to  detail  how  one  should  work  with  this  book  and  how  one  could 
best  benefit  from  this  work.  This  chapter  is  to  be  understood  as  a  user’s  (or 
instructor’s)  manual  that  details  our  pedagogical  choices.  It  also  seems  the  right 
place  to  introduce  the  programming  language  R,  which  we  use  to  illustrate  all  the 
introduced  concepts. 

In  each  chapter,  both  Ian  Rankin’s  quotation  and  the  figure  on  top  of  the  title 
page  are  (at  best)  vaguely  related  to  the  topic  of  the  chapter,  and  one  should  not 
waste  too  much  time  pondering  their  implications  and  multiple  meanings.  The 
similarity  with  the  introductory  chapter  of  Introducing  Monte  Carlo  Methods  with 
R  is  not  coincidental,  as  Robert  and  Casella  (2009)  used  the  same  skeleton  as  in 
Bayesian  Core  and  as  we  restarted  from  their  version. 


J.-M.  Marin  and  C.P.  Robert,  Bayesian  Essentials  with  R ,  Springer  Texts 
in  Statistics,  DOI  10. 1007/978- l-4614-8687-9_l, 

©  Springer  Science+Business  Media  New  York  2014 
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1  User’s  Manual 


1.1  Expectations 

The  key  word  associated  with  this  book  is  modeling ,  that  is,  the  ability  to 
build  up  a  probabilistic  interpretation  of  an  observed  phenomenon  and  the 
“story”  that  goes  with  it.  The  “grand  scheme”  is  to  get  anyone  involved  in 
analyzing  data  to  process  a  dataset  within  this  coherent  methodology.  This 
means  picking  a  parameterized  probability  distribution,  denoted  by  /#,  and 
extracting  information  about  (shortened  in  “estimating”)  the  unknown  pa¬ 
rameter  6  of  this  probability  distribution  in  order  to  provide  a  convincing 
interpretation  of  the  reasons  that  led  to  the  phenomenon  at  the  basis  of  the 
dataset  (and/or  to  be  able  to  draw  predictions  about  upcoming  phenomena 
of  the  same  nature).  Before  starting  the  description  of  the  probability  distri¬ 
butions,  we  want  to  impose  on  the  reader  the  essential  feature  that  a  model 
is  an  interpretation  of  a  real  phenomenon  that  fits  its  characteristics  up  to 
some  degree  of  approximation  rather  than  an  explanation  that  would  require 
the  model  to  be  “true”.  In  short,  there  is  no  such  thing  as  a  “true  model”, 
even  though  some  models  are  more  appropriate  than  others! 

In  this  book,  we  chose  to  describe  the  use  of  “classical”  probability  models 
for  several  reasons:  First,  it  is  often  better  to  start  a  trip  on  well-traveled 
paths  because  they  are  less  likely  to  give  rise  to  unexpected  surprises  and 
misinterpretations.  Second,  they  can  serve  as  references  for  more  advanced 
modelings:  Quantities  that  appear  in  both  simple  and  advanced  modelings 
should  get  comparable  estimators  or,  if  not,  the  more  advanced  modeling 
should  account  for  that  difference.  At  last,  the  deliberate  choice  of  an  artificial 
model  should  give  a  clearer  meaning  to  the  motto  that  all  models  are  false 
in  that  it  illustrates  the  fact  that  a  model  is  not  necessarily  justified  by  the 
theory  beyond  the  modeled  phenomenon  but  that  its  corresponding  inference 
can  nonetheless  be  exploited  as  if  it  were  a  true  model.  By  the  end  of  the  book, 
the  reader  should  also  be  in  a  position  to  assess  the  relevance  of  a  particular 
model  for  a  given  dataset. 

Working  with  this  book  should  not  appear  as  a  major  endeavor:  The 
datasets  are  described  along  with  the  methods  that  are  relevant  for  the  cor¬ 
responding  model,  and  the  statistical  analysis  is  provided  with  detailed  com¬ 
ments.  The  R  code  that  backs  up  this  analysis  is  included  and  commented 
throughout  the  text.  If  there  is  a  difficulty  with  this  scheme,  it  actually  starts 
at  this  point:  Once  the  reader  has  seen  the  analysis,  it  should  be  possible 
for  her  or  him  to  repeat  this  analysis  or  a  similar  analysis  with  no  further 
assistance.  Even  better,  the  reader  should  try  to  read  as  little  as  possible  of 
the  analysis  proposed  in  this  book  and  on  the  opposite  hand  should  try  to 
conduct  the  following  stage  of  the  analysis  before  reading  the  proposed  (but 
not  unique)  solution.  The  ultimate  lesson  here  is  that  there  are  indeed  many 
ways  to  analyze  a  dataset  and  to  propose  modeling  scenarios  and  inferential 
schemes.  It  is  beyond  the  purpose  of  this  book  to  provide  all  of  those  analyses, 
and  the  reader  (or  the  instructor)  is  supposed  to  look  for  alternatives  on  her 
or  his  own. 
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We  thus  expect  readers  to  place  themselves  in  a  realistic  situation  to  con¬ 
duct  this  analysis  in  life-threatening  (or  job-threatening)  situations.  As  de¬ 
tailed  in  the  preface,  the  course  was  originally  intended  for  students  in  the 
last  year  of  study  toward  a  professional  degree,  and  it  seems  quite  reasonable 
to  insist  that  they  face  similar  situations  before  entering  their  incoming  job! 


1.2  Prerequisites  and  Further  Reading 

This  being  a  textbook  about  statistical  modeling,  the  students  are  supposed  to 
have  a  background  in  both  probability  and  statistics,  at  the  level,  for  instance, 
of  Casella  and  Berger  (2001).  In  particular,  a  knowledge  of  standard  sampling 
distributions  and  their  properties  is  desirable.  Lab  work  in  the  spirit  of  Nolan 
and  Speed  (2000)  is  also  a  plus.  (One  should  read  in  particular  their  Ap¬ 
pendix  A  on  “How  to  write  lab  reports?”)  Further  knowledge  about  Bayesian 
statistics  is  not  a  requirement,  although  using  Robert  (2007)  or  Hoff  (2009) 
as  further  references  would  bring  a  better  insight  into  the  topics  treated  here. 

Similarly,  we  expect  students  to  be  able  to  understand  the  bits  of  R  pro¬ 
grams  provided  in  the  analysis,  mostly  because  the  syntax  of  R  is  very  simple. 
We  include  an  introduction  to  this  language  in  this  chapter  and  we  refer  to 
Dalgaard  (2002)  for  a  deeper  entry  and  also  to  Venables  and  Ripley  (2002). 

Besides  Robert  (2007),  the  philosophy  of  which  is  obviously  reflected  in  this 
book,  other  reference  books  pertaining  to  applied  Bayesian  statistics  include 
Gelman  et  al.  (2013),  Carlin  and  Louis  (1996),  and  Congdon  (2001,  2003). 
More  specific  books  that  cover  parts  of  the  topics  of  a  given  chapter  are 
mentioned  (with  moderation)  in  the  corresponding  chapter,  but  we  can  quote 
here  the  relevant  books  of  Holmes  et  al.  (2002),  Pole  et  al.  (1994),  and  Gill 
(2002).  We  want  to  stress  that  the  citations  are  limited  for  efficiency  purposes: 
There  is  no  extensive  coverage  of  the  literature  as  in,  e.g.,  Robert  (2007)  or 
Gelman  et  al.  (2013),  because  the  prime  purpose  of  the  book  is  to  provide 
a  working  methodology,  for  which  incremental  improvements  and  historical 
perspectives  are  not  directly  relevant. 

While  we  also  cover  simulation-based  techniques  in  a  self-contained  per¬ 
spective,  and  thus  do  not  assume  prior  knowledge  of  Monte  Carlo  methods, 
detailed  references  are  Robert  and  Casella  (2004,  2009)  and  Chen  et  al.  (2000). 

Although  we  had  at  some  stage  intended  to  write  a  new  chapter  about 
hierarchical  Bayes  analysis,  we  ended  up  not  including  this  chapter  in  the 
current  edition  and  this  for  several  reasons.  First,  we  were  not  completely 
convinced  about  the  relevance  of  a  specific  hierarchical  chapter,  given  that 
the  hierarchical  theme  is  somehow  transversal  to  the  book  and  pops  in  the 
mixture  (Chap.  6),  dynamic  (Chap.  7)  and  image  (Chap.  8)  chapters.  Second, 
the  revision  took  already  too  long  and  creating  a  brand  new  chapter  did  not 
sound  a  manageable  goal.  Third,  managing  realistic  hierarchical  models  meant 
relying  on  codes  written  in  JAGS  and  BUGS,  which  clashed  with  the  philoso¬ 
phy  of  backing  the  whole  book  on  R  codes.  This  was  subsumed  by  the  recent 
and  highly  relevant  publication  of  The  BUGS  Book  (Lunn  et  al.,  2012)  and  by 
the  incoming  new  edition  of  Bayesian  Data  Analysis  (Gelman  et  al.,  2013). 
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1.3  Styles  and  Fonts 

Presentation  often  matters  almost  as  much  as  content  towards  a  better  und¬ 
erstanding,  and  this  is  particularly  true  for  data  analyzes,  since  they  aim 
to  reproduce  a  realistic  situation  of  a  consultancy  job  where  the  consultant 
must  report  to  a  customer  the  results  of  an  analysis.  An  equilibrated  use 
of  graphics,  tables,  itemized  comments,  and  short  paragraphs  is,  for  instance, 
quite  important  for  providing  an  analysis  that  stresses  the  different  conclusions 
of  the  work,  as  well  as  the  points  that  are  yet  unclear  and  those  that  could 
be  expanded. 

In  particular,  because  this  book  is  doing  several  things  at  once  (that  is, 
to  introduce  theoretical  and  computational  concepts  and  to  implement  them 
in  realistic  situations),  it  needs  to  differentiate  between  the  purposes  and  the 
levels  of  the  parts  of  the  text  so  that  it  is  as  obvious  as  possible  to  the  reader. 
To  this  effect,  we  take  advantage  of  the  many  possibilities  of  modern  computer 
editing,  and  in  particular  of  UTgX,  as  follows. 

First,  a  minimal  amount  of  theoretical  bases  is  required  for  dealing  with 
the  model  introduced  in  each  chapter,  either  for  Bayesian  statistics  or  for 
Monte  Carlo  theory.  This  aspect  of  the  material  is  necessarily  part  of  the 
main  text,  but  it  is  also  kept  to  a  minimum — just  enough  for  the  book  to 
be  self-contained — and  therefore  occasional  references  to  more  detailed  books 
such  as  Robert  (2007)  and  Robert  and  Casella  (2004)  are  necessary.  These 
sections  need  be  well-understood  before  handling  the  following  applications 
or  realistic  cases.  This  book  is  primarily  intended  for  those  without  a  strong 
background  in  the  theory  of  Bayesian  statistics  or  computational  methods, 
and  “theoretical”  sections  are  essential  for  them,  hence  the  need  to  keep  those 
sections  within  the  main  text. 


Statistics  is  as  much  about  data  processing  as  about  mathemat¬ 
ical  and  probabilistic  modeling.  To  enforce  this  principle,  we  center 
each  chapter  around  one  or  two  specific  realistic  datasets  that  are  de¬ 
scribed  early  enough  in  the  chapter  to  be  used  extensively  through¬ 
out  the  chapter.  These  datasets  are  available  on  the  book’s  Website 
(http://www.ceremade.dauphine.fr/~xian/BCS/)  and  are  part  of  the  cor¬ 
responding  R  package  bayess,  as  normaldata,  capturedata,  and  so  on,  the 
name  being  chosen  in  reference  to  the  case/chapter  heading.  (Some  of  these 
datasets  are  already  available  as  datasets  in  the  R  language.)  In  particu¬ 
lar,  we  explain  the  “how  and  why”  of  the  corresponding  dataset  in  a  separate 
paragraph  in  this  shaded  format.  This  style  is  also  used  for  illustrating  theoret¬ 
ical  developments  for  the  corresponding  dataset  and  for  specific  computations 
related  to  this  dataset.  For  typographical  convenience,  large  graphs  and  tables 
may  appear  outside  these  sections,  in  subsequent  pages,  but  are  obviously 
mentioned  and  identified  within  them. 
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Example  1.1.  There  may  also  be  a  need  for  detailed  examples  in  addition 
to  the  main  datasets,  although  we  strived  to  keep  them  to  a  minimum  and 
only  for  very  specific  issues  where  the  reference  dataset  was  not  appropriate. 
They  follow  this  numbered  style,  the  sideways  triangle  indicating  the  end  of 
the  example.  ◄ 


^  The  last  style  used  in  the  book  is  the  warning,  represented  by  a  lightning  / 
symbol  in  the  margin:  This  entry  is  intended  to  signal  major  warnings  about 
things  that  can  (and  do)  go  wrong  “otherwise”;  that  is,  if  the  warning  is  not 
taken  into  account.  Needless  to  say,  these  paragraphs  must  be  given  the  utmost 
attention! 

A  diverse  collection  of  exercises  is  proposed  at  the  end  of  each  chapter,  with 
solutions  to  all  those  exercises  freely  available  on  Springer- Verlag  webpage. 

1.4  An  Introduction  to  R 

This  section  attempts  at  introducing  R  to  newcomers  in  a  few  pages  and, 
as  such,  it  should  not  be  considered  as  a  proper  introduction  to  R.  Entire 
volumes,  such  as  the  monumental  R  Book  by  Crawley  (2007),  and  the  intro¬ 
duction  by  Dalgaard  (2002),  are  dedicated  to  the  practice  of  this  language, 
and  therefore  additional  efforts  (besides  reading  this  chapter)  will  be  required 
from  the  reader  to  sufficiently  master  the  language.  However,  before  discour¬ 
aging  anyone,  let  us  comfort  you  with  the  fact  that: 

(a)  The  syntax  of  R  is  simple  and  logical  enough  to  quickly  allow  for  a  basic 
understanding  of  simple  R  programs,  as  should  become  obvious  in  a  few 
paragraphs. 

(b)  The  best,  and  in  a  sense  the  only,  way  to  learn  R  is  through  trial- and- 
error  on  simple  and  then  more  complex  examples.  Reading  the  book  with 
a  computer  available  nearby  is  therefore  the  best  way  of  implementing 
this  recommendation. 

In  particular,  the  embedded  help  commands  helpO  and  help .  search  ()  are 
very  good  starting  points  to  gather  information  about  a  specific  function  or 
a  general  issue,  even  though  more  detailed  manuals  are  available  both  locally 
and  on-line.  Note  that  help .  start  ()  opens  a  Web  browser  linked  to  the  local 
manual  pages. 

One  may  first  wonder  why  we  support  using  R  as  the  programming  int¬ 
erface  for  this  introduction  to  Monte  Carlo  methods,  since  there  exist  other 


xIf  you  decide  to  skip  this  chapter,  be  sure  to  at  least  print  the  handy  R  Ref¬ 
erence  Card  available  at  http://cran.r-project.org/doc/contrib/Short-refcard.pdf  that 
summarizes,  in  four  pages,  the  major  commands  of  R. 
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languages,  most  (all?)  of  them  faster  than  R,  like  Matlab,  and  some  even  free, 
like  C  or  Python.  We  obviously  have  no  partisan  or  commercial  involvement 
in  this  language.2  Rather,  besides  the  ease  of  presentation,  our  main  reason 
for  this  choice  is  that  the  language  combines  a  sufficiently  high  power  (for  an 
interpreted  language)  with  a  very  clear  syntax  both  for  statistical  computation 
and  graphics.  R  is  a  flexible  language  that  is  object-oriented  and  thus  allows 
the  manipulation  of  complex  data  structures  in  a  condensed  and  efficient 
manner.  Its  graphical  abilities  are  also  remarkable.  R  provides  a  powerful 
interface  that  can  integrate  programs  written  in  other  languages  such  as  C, 
C++,  Fortran,  Perl,  Python,  and  Java.  At  last,  it  is  increasingly  common  to  see 
people  who  develop  new  methodology  simultaneously  producing  an  R  package 
in  support  of  their  approach  and  to  back  up  introductory  statistics  courses 
with  illustrations  in  R. 

One  choice  we  have  not  addressed  above  is  “why  R  and  not  BUGS?”  BUGS 
(which  stands  for  Bayesian  inference  Using  Gibbs  Sampling)  is  a  Bayesian 
analysis  software  developed  since  the  early  1990s,  mostly  by  researchers  from 
the  Medical  Research  Council  (MRC)  at  Cambridge  University.  The  most 
common  version  is  Win  Bugs,  working  under  Windows,  but  there  also  exists  an 
open-source  version  called  Open  Bugs.  So,  to  return  to  the  initial  question,  we 
are  not  addressing  the  possible  links  and  advantages  of  BUGS  simply  because 
the  purpose  is  different.  While  access  to  Monte  Carlo  specifications  is  possible 
in  BUGS,  most  computing  operations  are  handled  by  the  software  itself,  with 
the  possible  outcome  that  the  user  does  not  bother  about  this  side  of  the 
problem  and  instead  concentrates  on  Bayesian  modeling.  Thus,  while  R  can 
be  easily  linked  with  BUGS  and  simulation  can  be  done  via  BUGS,  we  think 
that  a  lower-level  language  such  as  R  is  more  effective  in  bringing  you  in 
touch.  However,  more  advanced  models  like  the  hierarchical  models  cannot 
be  easily  handled  by  basic  R  programming  and  packages  are  not  necessarily 
available  to  handle  the  variety  of  those  models  and  call  for  other  programming 
languages  like  JAGS.  (JAGS  standing  for  Just  Another  Gibbs  Sampler  and 
being  dedicated  to  the  study  of  Bayesian  hierarchical  models.  This  program 
is  also  freely  available  and  distributed  under  the  GNU  Licence,  the  current 
version  being  JAGS  3.3.0.) 

1.4.1  Getting  Started 

The  R  language  is  straightforward  to  install:  it  can  be  downloaded  (obviously 
free)  from  one  of  the  numerous  CRAN  (Comprehensive  R  Archive  Network) 
mirror  Websites  around  the  world.3 

At  this  stage,  we  refrain  from  covering  the  installation  of  the  R  package 
and  thus  assume  that  (a)  R  is  installed  on  the  machine  you  want  to  work  with 
and  (b)  that  you  have  managed  to  launch  it  (in  most  cases,  you  simply  have 

2 Once  again,  R  is  a  freely  distributed  and  open-source  language. 

3  The  main  CRAN  Website  is  http://cran.r-project.org/. 
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to  click  on  the  proper  icon).  In  the  event  you  use  a  friendly  (GUI)  interface 
like  RKWard,  the  interface  opens  several  windows  whose  use  should  be  self- 
explanatory  (along  with  a  proper  on-line  help).  Otherwise,  you  should  then 
obtain  a  terminal  window  whose  first  lines  resemble  the  following,  most  likely 
with  a  more  recent  version: 

R  version  2.14.1  (2011-12-22) 

Copyright  (C)  2011  The  R  Foundation  for  Statistical  Computing 

ISBN  3-900051-07-0 

Platform:  i686-pc-linux-gnu  (32-bit) 

R  is  free  software  and  comes  with  ABSOLUTELY  NO  WARRANTY. 

You  are  welcome  to  redistribute  it  under  certain  conditions. 

Type  ’ license ()  ’  or  ’ licence ()’  for  distribution  details. 

R  is  a  collaborative  project  with  many  contributors. 

Type  ’ contributors () ’  for  more  information  and 

’  citationO’  on  how  to  cite  R  or  R  packages  in  publications. 

Type  ’demoO’  for  some  demos,  ’helpO’  for  on-line  help,  or 

’help . start () ’  for  an  HTML  browser  interface  to  help. 

Type  ’q()’  to  quit  R. 

> 

Neither  this  austere  beginning  nor  the  prospect  of  using  a  line  editor  should 
put  you  off,  though,  as  there  are  many  other  ways  of  inputting  and  outputting 
commands  and  data,  as  we  shall  soon  see!  The  final  line  above  with  the  symbol 
>  means  that  the  R  software  is  waiting  for  a  command  from  the  user.  This 
character  >  at  the  beginning  of  each  line  in  the  executable  window  is  called  the 
prompt  and  precedes  the  line  command,  which  is  terminated  by  pressing  the 
RETURN  key.  At  this  early  stage,  all  commands  will  be  passed  as  line  commands, 
and  you  should  thus  spot  commands  thanks  to  this  symbol. 

Commands  and  programs  that  need  to  be  stopped  during  their  execution, 
for  instance  because  they  take  too  long  or  too  much  memory  to  complete  or 
because  they  involve  a  programming  mistake  such  as  an  infinite  loop,  can  be 
stopped  by  the  Control-C  double-key  action  without  exiting  the  R  session. 

For  memory  and  efficiency  reasons,  R  does  not  install  all  the  available 
functions  and  programs  when  launched  but  only  the  basic  packages  that  it 
requires  to  run  properly.  Additional  packages  can  be  loaded  via  the  library 
command,  as  in 

>  library (mnormt)  #  Multivariate  Normal  and  t  Distributions 

and  the  entire  list  of  available  packages  is  provided  by  library  () .  (The  symbol 
#  in  the  prompt  lines  above  indicates  a  comment:  All  characters  following  # 
until  the  end  of  the  command  line  are  ignored.  Comments  are  recommended  to 
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improve  the  readability  of  your  programs.)  There  exist  hundreds  of  packages 
available  on  the  Web.4  Installing  a  new  package  such  as  the  package  mnormt 
is  done  by  downloading  the  file  from  the  Web  depository  and  calling 

>  install .package ("mnormt") 

For  a  given  package,  the  install  .package  command  obviously  needs  to  be 
executed  only  once,  while  the  library  call  is  required  each  time  R  is  launched 
(as  the  corresponding  package  is  not  kept  as  part  of  the  .RData  file).  Thus,  it 
is  good  practice  to  include  calls  to  required  libraries  within  your  R  programs 
in  order  to  avoid  error  messages  when  launching  them. 

1.4.2  R  Objects 

As  with  many  advanced  programming  languages,  R  distinguishes  between 
several  types  of  objects.  Those  types  include  scalar,  vector,  matrix,  time  series, 
data  frames,  functions,  or  graphics.  An  R  object  is  mostly  characterized  by  a 
mode  that  describes  its  contents  and  a  class  that  describes  its  structure.  The 
R  function  str  applied  to  any  R  object,  including  R  functions,  will  show  its 
structure.  For  instance, 

>  str(log) 

function  (x,  base  =  exp(l)) 

The  different  modes  are 

null  (empty  object), 

-  logical  (TRUE  or  FALSE), 

numeric  (such  as  3,  0.14159,  or  2+sqrt(3)), 
complex,  (such  as  3-2i  or  complex (1 ,4, -2) ),  and 

character  (such  as  f ‘Blue J  J ‘ c binomial 3 3 ,  ‘ ‘male 3 3 , or  ‘ ‘ y=a+bxJ J ), 

and  the  main  classes  are  vector,  matrix,  array,  factor,  time-series, 
data,  frame,  and  list.  Heterogeneous  objects  such  as  those  of  the  list  class 
can  include  elements  with  various  modes.  Manual  entries  about  those  classes 
can  be  obtained  via  the  help  commands  help  (data,  frame)  or  Tmatrix  for 
instance. 

R  can  operate  on  most  of  those  types  as  a  regular  function  would  operate 
on  a  scalar,  which  is  a  feature  that  should  be  exploited  as  much  as  possible  for 
compact  and  efficient  programming.  The  fact  that  R  is  interpreted  rather  than 
compiled  involves  many  subtle  differences,  but  a  major  issue  is  that  all  vari¬ 
ables  in  the  system  are  evaluated  and  stored  at  every  step  of  R  programs.  This 
means  that  loops  in  R  are  enormously  time-consuming  and  should  be  avoided 
at  all  costs!  Therefore,  using  the  shortcuts  offered  by  R  in  the  manipulation 
of  vectors,  matrices,  and  other  structures  is  a  must. 

4Packages  that  have  been  validated  and  tested  by  the  R  core  team  are  listed  at 

http:  / /cran.r-project.org/src/contrib/PACKAGES.html. 
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The  vector  class 

As  indicated  logically  by  its  name,  the  vector  object  corresponds  to  a 
mathematical  vector  of  elements  of  the  same  type,  such  as  (TRUE ,  TRUE ,  FALSE) 
or  (1,2,3,5,7,11).  Creating  small  vectors  can  be  done  using  the  R  command 
c()  as  in 

>  a=c (2 , 6 , -4, 9 , 18) 

This  fundamental  function  combines  or  concatenates  terms  together.  For  in¬ 
stance, 

>  d=c(a,b) 

concatenates  the  two  vectors  a  and  b  into  a  new  vector  d.  Note  that  decimal 
numbers  should  be  encoded  with  a  dot,  character  strings  in  quotes  "  ",  and 
logical  values  with  the  character  strings  TRUE  and  FALSE  or  with  their  respec¬ 
tive  abbreviations  T  and  F.  Missing  values  are  encoded  with  the  character 
string  NA. 

In  Fig.  1.1,  we  give  a  few  illustrations  of  the  use  of  vectors  in  R.  The  char¬ 
acter  +  indicates  that  the  console  is  waiting  for  a  supplementary  instruction, 
which  is  useful  when  typing  long  expressions.  The  assignment  operator  is  =, 
not  to  be  confused  with  ==,  which  is  the  Boolean  operator  for  equality.  An 
older  assignment  operator  is  <-,  as  in 

>  x  <-  c(3,6,9) 

and,  at  least  for  compatibility  reasons,  it  still  remains  functional  in  current 
versions  of  R,  but  we  prefer  using  the  equality  sign.  (As  pointed  out  by  Spector 
(2009),  an  exception  is  when  using  system . time,  briefly  described  in  Fig.  1.8, 
since  =  is  then  used  to  identify  keywords,  although  =  can  preserve  its  initial 
purpose  if  curly  brackets  {  and  }  delimit  the  allocation  commands.) 

(l  A  misleading  feature  of  the  assignment  operator  <-  is  found  in  Boolean  expres¬ 
sions  such  as 

>  if  (x [1] <-2)  .  .  . 

which  is  supposed  to  test  whether  or  not  x[l]  is  less  than  -2  but  ends  up 

allocating  2  to  x[l] ,  erasing  its  current  value!  Adding  a  space  in  the  expression 

is  sufficient  to  solve  the  problem:  if  (x[l]  <  -2). 

Note  also  that  using 

>  if  (x  [1] =-2)  .  .  . 

mistakenly  instead  of  (x[l]==-2)  has  the  same  consequence. 

New  R  objects  are  simply  defined  by  assigning  them  a  value,  as  in  the 
first  line  of  Fig.  1.1,  without  a  preliminary  declaration  of  type  (as  in  the  C 
language) . 
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>  a=c (5,5.6, 1,4, -  5)  build  the  object  a  containing  a  numeric  vector 

of  dimension  5  with  elements  5,  5.6,  1,  4,  -5 
display  the  first  element  of  a 
build  the  numeric  vector  b  of  dimension  3 
with  elements  5.6,  1,  4 
build  the  numeric  vector  d  of  dimension  3 
with  elements  5,  1,  -5 
multiply  each  element  of  a  by  2 
and  display  the  result 
provides  each  element  of  b  modulo  3 
computes  the  integer  division  of  each  element  of  d  by  2.4 
build  the  numeric  vector5  e  of  dimension  3 
and  elements  3/5,  3,  -3/5 
multiply  the  vectors  d  and  e  term  by  term 
and  transform  each  term  into  its  natural  logarithm 
calculate  the  sum  of  d 
display  the  length  of  d 
transpose  d,  the  result  is  a  row  vector 
scalar  product  between  the  row  vector  t(b)  and 
the  column  vector  e  with  identical  length 
element-wise  product  between  two  vectors 
with  identical  lengths 

>  g=c(sqrt(2) ,  log (10))  build  the  numeric  vector  g  of  dimension  2 

and  elements  log(10) 

>  e  [d==5]  build  the  subvector  of  e  that  contains  the 

components  e  [i]  such  that  d  [i]  =5 

>  a [-3]  create  the  subvector  of  a  that  contains 

all  components  of  a  but  the  third. 

>  is. vector (d)  display  the  logical  expression  TRUE  if 

a  vector  and  FALSE  else 

Fig.  1.1.  Illustrations  of  the  processing  of  vectors  in  R 

Note6  in  Table  1.1  the  convenient  use  of  Boolean  expressions  to  ex¬ 
tract  subvectors  from  a  vector  without  having  to  resort  to  a  component-by- 
component  test  (and  hence  a  loop).  The  quantity  d==5  is  itself  a  vector  of 
Booleans,  while  the  number  of  components  satisfying  the  constraint  can  be 
computed  by  sum(d==5) .  The  ability  to  apply  scalar  functions  to  vectors  as  a 
whole  is  also  a  major  advantage  of  R.  In  the  event  the  function  depends  on  a 
parameter  or  an  option,  this  quantity  can  be  entered  as  in 

5 The  variable  e  is  not  predefined  in  R  as  exp(l). 

6 Positive  and  negative  indices  cannot  be  used  simultaneously. 


>  a  [1] 

>  b=a  [2  : 4] 

>  d=a[c(l ,3,5)] 

>  2*a 

>  b°/0°/o3 

>  d°/o/0/02 . 4 

>  e=3/d 

>  log(d*e) 

>  sum(d) 

>  length (d) 

>  t(d) 

>  t(d)°/0*°/oe 

>  t(d)*e 
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>  e=lgamma(e''2)  #warning:  this  is  not  the  exponential  basis, 
exp(l) 

which  returns  the  vector  with  components  logT(e^).  Functions  that  are  spe¬ 
cially  designed  for  vectors  include,  for  instance,  sample,  order,  sort  and  rank, 
which  all  have  to  do  with  manipulating  the  order  in  which  the  components  of 
the  vector  occur. 

Besides  their  numeric  and  logical  indexes,  the  components  of  a  vector  can 
also  be  identified  by  names.  For  a  given  vector  x,  names  (x)  is  a  vector  of 
characters  of  the  same  length  as  x.  This  additional  attribute  is  most  useful 
when  dealing  with  real  data,  where  the  components  have  a  meaning  such 
as  "unemployed"  or  "democrat".  Those  names  can  also  be  erased  by  the 
command 

>  name  s ( x ) =NULL 


^  The  :  operator  found  in  Fig.  1.1  is  a  very  useful  device  that  defines  a  consecutive 
sequence,  but  it  is  also  fragile  in  that  sequences  do  not  always  produce  what  is 
expected.  For  instance,  l:2*n  corresponds  to  (l:2)*n  rather  than  l:(2*n). 

The  matrix,  array,  and  factor  classes 

The  matrix  class  provides  the  R  representation  of  matrices.  A  typical  entry 
is,  for  instance, 

>  x=matrix(vec ,nrow=n,ncol=p) 

which  creates  an  n  x  p  matrix  whose  elements  are  those  of  the  vector  vec, 
assuming  this  vector  is  of  dimension  np.  An  important  feature  of  this  entry 
is  that,  in  a  somewhat  unusual  way,  the  components  of  vec  are  stored  by 
column,  which  means  that  x[l,l]  is  equal  to  vecfl],  x[2,l]  is  equal  to 
vec  [2] ,  and  so  on,  except  if  the  option  byrow=T  is  used  in  matrix.  (Because 
of  this  choice  of  storage  convention,  working  on  R  matrices  column-wise  is 
faster  then  working  row-wise.)  Note  also  that,  if  vec  is  of  dimension  n  x  p,  it 
is  not  necessary  to  specify  both  the  nrow=n  and  ncol=p  options  in  matrix. 
One  of  those  two  parameters  is  sufficient  to  define  the  matrix.  On  the  other 
hand,  if  vec  is  not  of  dimension  n  x  p,  matrix  (vec  ,nrow=n,ncol=p)  will 
create  an  n  x  p  matrix  with  the  components  of  vec  repeated  the  appropriate 
number  of  times.  For  instance, 

>  matrix (1 : 4,ncol=3) 

C,  1]  [ ,  2]  [ ,  3] 

[1,]  131 

[2,]  2  4  2 

Warning  message: 

data  length  [4]  is  not  a  submultiple  or  multiple  of  the 
number 

of  columns  [3]  in  matrix  in:  matrix(l:4,  ncol  =  3) 
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produces  a  2  x  3  matrix  along  with  a  warning  message  that  something  may 
be  missing  in  the  call  to  matrix.  Note  again  that  1,2,  3, 4  are  entered  con¬ 
secutively  when  following  the  column  (or  lexicographic)  order.  Names  can  be 
given  to  the  rows  and  columns  of  a  matrix  using  the  rownames  and  colnames 
functions. 

Note  that,  in  some  situations,  it  is  useful  to  remember  that  an  R  matrix 
can  also  be  used  as  a  vector.  If  x  is  an  n  x  p  matrix,  x[i+p*(j~l)]  is  equal 
to  x[i,j],  i.e.,  x  can  also  be  manipulated  as  a  vector  made  of  the  columns 
of  vec  piled  on  top  of  one  another.  For  instance,  x[x>5]  is  a  vector,  while 
x  [x>5]  =0  modifies  the  right  entries  in  the  matrix  x.  Conversely,  vectors  can 
be  turned  into  p  x  1  matrices  by  the  command  as. matrix.  Note  that  x[l,] 
produces  the  first  row  of  x  as  a  vector  rather  than  as  a  p  x  1  matrix. 

R  allows  for  a  wide  range  of  manipulations  on  matrices,  both  termwise  and 
in  the  classical  matrix  algebra  perspective.  For  instance,  the  standard  matrix 
product  is  denoted  by  °/0*°/o,  while  *  represents  the  term-by-term  product.  (Note 
that  taking  the  product  a0/0*0/0b  when  the  number  of  columns  of  a  differs  from 
the  number  of  rows  of  b  produces  an  error  message.)  Figure  1.2  gives  a  few 
examples  of  matrix-related  commands.  The  apply  function  is  particularly  easy 
to  use  for  functions  operating  on  matrices  by  row  or  column. 


>  xl=matrix(l : 20,nrow=5) 

>  x2=matrix(l : 20 ,nrow=5 ,byrow=T) 

>  a=x3°/o*°/ox2 

>  x3=t(x2) 

>  b=x3°/o*°/ox2 

>  c=xl*x2 

>  dim(xl) 

>  b  [ ,  2] 

>  b[c(3,4),] 

>  b  [-2 , ] 

>  rbind(xl,x2) 

>  cbind(xl,x2) 

>  apply (xl , 1 , sum) 

>  as .matrix (1 : 10) 


build  the  numeric  matrix  xl  of  dimension 

5x4  with  first  row  1,  6,  11,  16 

build  the  numeric  matrix  x2  of  dimension 

5x4  with  first  row  1,  2,  3,  4 

matrix  summation  of  x2  and  x3 

transpose  the  matrix  x2 

matrix  product  between  x2  and  x3, 

with  a  check  of  the  dimension  compatibility 

term-by-term  product  between  xl  and  x2 

display  the  dimensions  of  xl 

select  the  second  column  of  b 

select  the  third  and  fourth  rows  of  b 

delete  the  second  row  of  b 

vertical  merging  of  xl  and  x2 

horizontal  merging  of  xl  and  x2 

calculate  the  sum  of  each  row  of  xl 

turn  the  vector  1:10  into  a  10  x  1  matrix 


Fig.  1.2.  Illustrations  of  the  processing  of  matrices  in  R 


The  function  diag  can  be  used  to  extract  the  vector  of  the  diagonal  el¬ 
ements  of  a  matrix,  as  in  diag(a),  or  to  create  a  diagonal  matrix  with  a 
given  diagonal,  as  in  diag(l:10).  Since  matrix  algebra  is  central  to  good 
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programming  in  R,  as  matrix  programming  allows  for  the  elimination  of  time- 
consuming  loops,  it  is  important  to  be  familiar  with  matrix  manipulation.  For 
instance,  the  function  crossprod  replaces  the  product  t(x)70*70y  on  either 
vectors  or  matrices  by  crossprod (x,y)  more  efficiently: 

>  system. time ( crossprod (1 : 10~6, 1:10 ~6) ) 

user  system  elapsed 
0.016  0.048  0.066 

>  system,  time  (t  (1 : 10~6)°/0*°/o(l :  10~6) ) 

user  system  elapsed 
0.084  0.036  0.121 

Eigen- analysis  of  square  matrices  is  also  included  in  the  base  package.  For 
instance,  chol(m)  returns  the  upper  triangular  factor  of  the  Choleski  decom¬ 
position  of  m;  that  is,  the  matrix  R  such  that  RT R  is  equal  to  m.  Similarly, 
eigen (m)  returns  a  list  that  contains  the  eigenvalues  of  m  (some  of  which  can 
be  complex  numbers)  as  well  as  the  corresponding  eigenvectors  (some  of  which 
are  complex  if  there  are  complex  eigenvalues).  Related  functions  are  svd  and 
qr,  which  provide  the  singular  values  and  the  QR  decomposition  of  their  ar¬ 
gument,  respectively.  Note  that  the  inverse  M-1  of  a  matrix  M  can  be  found 
either  by  solve  (M)  (recommended)  or  ginv(M),  which  requires  downloading 
the  library  MASS  and  also  produces  generalized  inverses  (which  may  be  a 
mixed  blessing  since  the  fact  that  a  matrix  is  not  invertible  is  not  signaled  by 
ginv).  Special  versions  of  solve  are  backsolve  and  f  orwardsolve,  which  are 
restricted  to  upper  and  lower  diagonal  triangular  systems,  respectively.  Note 
also  the  alternative  of  using  chol2inv  which  returns  the  inverse  of  a  matrix 
m  when  provided  by  the  Choleski  decomposition  chol  (m) . 

Structures  with  more  than  two  indices  are  represented  by  arrays  and  can 
also  be  processed  by  R  commands,  for  instance  x=array(l :  50, c (2,5,5) ) , 
which  gives  a  three-entry  table  of  50  terms.  Once  again,  they  can  also  be 
interpreted  as  vectors. 

The  apply  function  used  in  Fig.  1.2  is  a  very  powerful  device  that  op¬ 
erates  on  arrays  and,  in  particular,  matrices.  Since  it  can  return  arrays,  it 
bypasses  calls  to  multiple  loops  and  makes  for  (sometimes)  quicker  and  (al¬ 
ways)  cleaner  programs.  It  should  not  be  considered  as  a  panacea,  however, 
as  apply  hides  calls  to  loops  inside  a  single  command.  For  instance,  a  com¬ 
parison  of  apply  (A,  l,mean)  with  rowMeans(A)  shows  the  second  version  is 
about  200  times  faster.  Using  linear  algebra  whenever  possible  is  therefore  a 
more  efficient  solution.  Spector  (2009,  Sect.  8.7)  gives  a  detailed  analysis  of 
the  limitations  of  apply  and  the  advantages  of  vectorization  in  R. 

A  factor  is  a  vector  of  characters  or  integers  used  to  specify  a  discrete 
classification  of  the  components  of  other  vectors  with  the  same  length.  Its 
main  difference  from  a  standard  vector  is  that  it  comes  with  a  level  attribute 
used  to  specify  the  possible  values  of  the  factor.  This  structure  is  therefore 
appropriate  to  represent  qualitative  variables.  R  provides  both  ordered  and 
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unordered  factors,  whose  major  appeal  lies  within  model  formulas,  as  illus¬ 
trated  in  Fig.  1.3.  Note  the  subtle  difference  between  apply  and  t apply. 


>  state=c("tas" , "tas" , "sa" , "sa" , "wa" ) 

>  statef=f actor (state) 

>  levels (statef) 

>  incomes=c(60,59,40,42,23) 

>  tapply (incomes , statef , mean) 

>  statef =f actor (state , 

+  levels=c("tas" , "sa 

>  table (statef ) 


create  a  vector  with  five  values 
distinguish  entries  by  group 
give  the  groups 
create  a  vector  of  incomes 
average  the  incomes  for  each  group 
define  a  new  level  with  one  more 
group  than  observed 
return  statistics  for  all  levels 


wa" , "yo" ) ) 

Fig.  1.3.  Illustrations  of  the  factor  class 


The  list  and  data. frame  classes 

A  list  in  R  is  a  rather  loose  object  made  of  a  collection  of  other  arbitrary 
objects  known  as  its  components J  For  instance,  a  list  can  be  derived  from  n 
existing  objects  using  the  function  list: 

a=list (name_l=object_l , . . . ,name_n=obj ect_n) 

This  command  creates  a  list  with  n  arguments  using  obj  ect_l ,  .  .  .  ,  obj  ect_n 
for  the  components,  each  being  associated  with  the  argument’s  name,  name_i. 
For  instance,  a$name_l  will  be  equal  to  object_l.  (It  can  also  be  represented 
as  a  [  [  1]  ] ,  but  this  is  less  practical,  as  it  requires  some  bookkeeping  of  the 
order  of  the  objects  contained  in  the  list.)  Lists  are  very  useful  in  preserving 
information  about  the  values  of  variables  used  within  R  functions  in  the  sense 
that  all  relevant  values  can  be  put  within  a  list  that  is  the  output  of  the 
corresponding  function  (see  Sect.  1.4.5  for  details  about  the  construction  of 
functions  in  R).  Most  standard  functions  in  R,  for  instance  eigen  in  Fig.  1.4, 
return  a  list  as  their  output.  Note  the  use  of  the  abbreviations  vec  and  val 
in  the  last  line  of  Fig.  1.4.  Such  abbreviations  are  acceptable  as  long  as  they 
do  not  induce  confusion.  (Using  res$v  would  not  work!) 

The  local  version  of  apply  is  lapply,  which  computes  a  function  for  each 
argument  of  the  list 

>  x  =  list(a  =  1:10,  beta  =  exp(-3:3), 

+  logic  =  c (TRUE, FALSE, FALSE, TRUE)) 

>  lapply (x, mean)  #compute  the  empirical  means 

$a 

[1]  5.5 


7Lists  can  contain  lists  as  elements. 
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>  li=list (num=l : 5 ,y="color" , a=T) 

>  a=matrix(c(6,2,0,2,6,0,0,0,36) ,nrow=3) 

>  res=eigen(a, symmetric=T) 

>  names (res) 

>  res$vectors 

>  diag(res$values) 

>  res$vec°/0*°/0diag(res$val)  °/0*°/ot  (res$vec) 


create  a  list  with  three  arguments 
create  a  (3,3)  matrix 
diagonalize  a  and 
produce  a  list  with  two 
arguments:  vectors  and  values 
vectors  arguments  of  res 
create  the  diagonal  matrix 
of  eigenvalues 
recover  a 


Fig.  1.4.  Chosen  features  of  the  list  class 


$beta 

[1]  4.535125 

$logic 

[1]  0.5 

provided  each  argument  is  of  a  mode  that  is  compatible  with  the  function 
argument  (i.e.,  is  numeric  in  this  case).  A  “user-friendly”  version  of  lapply  is 
sapply,  as  in 

>  sapply (x, mean) 

a  beta  logic 
5.500000  4.535125  0.500000 

The  last  class  we  briefly  mention  here  is  the  data. frame.  A  data  frame  is 
a  list  whose  elements  are  possibly  made  of  differing  modes  and  attributes  but 
have  the  same  length,  as  in  the  example  provided  in  Fig.  1.5.  A  data  frame  can 
be  displayed  in  matrix  form,  and  its  rows  and  columns  can  be  extracted  using 
matrix  indexing  conventions.  A  list  whose  components  satisfy  the  restrictions 
imposed  on  a  data  frame  can  be  coerced  into  a  data  frame  using  the  function 
as . data,  frame.  The  main  purpose  of  this  object  is  to  import  data  from  an 
external  file  by  using  the  read. table  function. 


1.4.3  Probability  Distributions  in  R 

R  is  primarily  a  statistical  language.  It  is  therefore  well-equipped  with  prob¬ 
ability  distributions.  As  described  in  Table  1.1,  all  standard  distributions  are 
available,  with  a  clever  programming  shortcut:  A  “core”  name,  such  as  norm, 
is  associated  with  each  distribution  and  the  four  basic  associated  functions, 
namely  the  cdf,  the  pdf,  the  quantile  function,  and  the  simulation  procedure, 
are  defined  by  appending  the  prefixes  d,  p,  q,  r  to  the  core  name,  such  as 
dnorm(),  pnorm(),  qnorm(),  and  rnorm().  Obviously,  each  function  requires 
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>  vl=sample(l : 12,30, rep=T) 

>  v2=sample (LETTERS [1 : 10] , 30,rep=T) 

>  v3=runif (30) 

>  v4=rnorm(30) 

>  xx=data. f rame (vl , v2,v3,v4) 


simulate  30  independent  uniform 
random  variables  on  {1,  2,  12} 

simulate  30  independent  uniform 
random  variables  on  { a ,  6,  j} 
simulate  30  independent  uniform 
random  variables  on  [0,  1] 
simulate  30  independent  realizations 
from  a  standard  normal  distribution 
create  a  data  frame 


Fig.  1.5.  Definition  of  a  dsts.frsme 


additional  entries,  as  in  pnorm(l .  96)  or  rnorm(10  ,mean=3 ,  sd=3) .  Recall  that 
pnormO  and  qnormQ  are  inverses  of  one  another. 


Table  1.1.  Standard  distributions  with  R  core  name 


Distribution 

Core 

Parameters 

Default  values 

Beta 

beta 

shape 1,  shape2 

Binomial 

binom 

size,  prob 

Cauchy 

cauchy 

location,  scale 

0,  1 

Chi-square 

chisq 

df 

Exponential 

exp 

1/mean 

1 

Fisher 

f 

df  1 ,  df  2 

Gamma 

gamma 

shape , 1/ scale 

NA,  1 

Geometric 

geom 

prob 

Hypergeometric 

hyper 

m,  n,  k 

Log-Normal 

lnorm 

mean,  sd 

0,  1 

Logistic 

logis 

location,  scale 

0,  1 

Normal 

norm 

mean,  sd 

0,  1 

Poisson 

pois 

lambda 

Student 

t 

df 

Uniform 

unif 

min ,  max 

0,  1 

Weibull 

weibull 

shape 

In  addition  to  these  probability  functions,  R  also  provides  a  battery  of 
(classical)  statistical  tools,  ranging  from  descriptive  statistics  to  nonparamet- 
ric  tests  and  generalized  linear  models.  A  description  of  these  abilities  is  not 
possible  in  this  section  but  we  refer  the  reader  to,  e.g.,  Dalgaard  (2002)  or 
Venables  and  Ripley  (2002)  for  a  complete  entry. 

1.4.4  Graphical  Facilities 

Another  clear  advantage  of  using  the  R  language  is  that  it  allows  a  very  rich 
range  of  graphical  possibilities.  Functions  such  as  plot  and  image  can  be 
customized  to  a  large  extent,  as  described  in  Venables  and  Ripley  (2002)  or 
Murrell  (2005)  (the  latter  being  entirely  dedicated  to  the  R  graphic  abilities). 
Even  though  the  default  output  of  plot  as  for  instance  in 
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>  plot (faithful) 

is  not  highly  most  enticing,  plot  is  incredibly  flexible:  To  see  the  number  of 
parameters  involved,  you  can  type  par()  that  delivers  the  default  values  of 
all  those  parameters. 

^  The  wealth  of  graphical  possibilities  offered  by  R  should  be  taken  advantage  of 
cautiously!  That  is,  good  design  avoids  clutter,  small  fonts,  unreadable  scale, 
etc.  The  recommendations  found  in  Tufte  (2001)  are  thus  worth  following  to 
avoid  horrid  outputs  like  those  often  found  in  some  periodicals!  In  addition, 
graphs  produced  by  R  usually  tend  to  look  nicer  on  the  current  device  than 
when  printed  or  included  in  a  slide  presentation.  Colors  may  change,  font  sizes 
may  turn  awkward,  separate  curves  may  end  up  overlapping,  and  so  on. 

Before  covering  the  most  standard  graphic  commands,  we  start  by  describ¬ 
ing  the  notion  of  device  that  is  at  the  core  of  those  graphic  commands.  Each 
graphical  operation  sends  its  outcome  to  a  device ,  which  can  be  a  graphi¬ 
cal  window  (like  the  one  that  automatically  appears  when  calling  a  graphical 
command  for  the  first  time  as  in  the  example  above)  or  a  hie  where  the  graph¬ 
ical  outcome  is  stored  for  printing  or  other  uses.  Under  Unix,  Linux  and  mac 
OS,  launching  a  new  graphical  window  can  be  done  via  Xll(),  with  many 
possibilities  for  customization  (such  as  size,  positions,  color,  etc.).  Once  a 
graphical  window  is  created,  it  is  given  a  device  number  and  can  be  managed 
by  functions  that  start  with  dev . ,  such  as  dev  .list,  dev .  set,  and  others.  An 
important  command  is  dev. off,  which  closes  the  current  graphical  window. 
When  the  device  is  a  hie,  it  is  created  by  a  function  that  is  named  after  its 
driver.  There  are  therefore  a  postscript,  a  pdf,  a  jpeg,  and  a  png  function. 
When  printing  to  a  hie,  as  in  the  following  example, 

>  pdf (file="f aith.pdf ") 

>  par (mf row=c (1 ,2) ,mar=c(4,2,2, 1)  ) 

>  hist (faithful  [, 1] ,nclass=21 , col="grey" ,main=" " , 

+  xlab=names (faithful) [1] ) 

>  hist (faithful [, 2] ,nclass=21 , col="wheat " ,main="" , 

+  xlab=names (faithful) [2] ) 

>  dev.offO 

closing  the  sequence  with  dev.  off  ()  is  compulsory  since  it  completes  the  hie, 
which  is  then  saved.  If  the  command  pdf  (file="f  aith.pdf ")  is  repeated, 
the  earlier  version  of  the  pdf  hie  is  erased. 

Of  course,  using  a  line  command  interface  for  controlling  graphics  may 
seem  antiquated,  but  this  is  the  consequence  of  the  R  object-oriented  philos¬ 
ophy.  In  addition,  current  graphs  can  be  saved  to  a  postscript  hie  using  the 
dev.  copy  and  dev. print  functions.  Note  that  R-produced  graphs  tend  to  be 
large  objects,  in  part  because  the  graphs  are  not  pictures  of  the  current  state 
but  instead  preserve  every  action  ever  taken. 
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As  already  stressed  above,  plot  is  a  highly  versatile  tool  that  can  be  used 
to  represent  functional  curves  and  two-dimensional  datasets.  Colors  (chosen 
by  colors ()  or  colours ()  out  of  650  hues),  widths,  and  types  can  be  cal¬ 
ibrated  at  will  and  RT^X-like  formulas  can  be  included  within  the  graphs 
using  expression.  Text  and  legends  can  be  included  at  a  specific  point  with 
locator  (see  also  identify)  and  legend.  An  example  of  (relatively  simple) 
output  is 

>  plot (as. vector (time (mdeaths) ) , as . vector (mdeaths) , cex=.6, 

+  pch=19 ,xlab=" " ,ylab= "Monthly  deaths  from  bronchitis") 

>  lines (spline (mdeaths) , lwd=2 , col="red" , lty=3) 

>  ar=ar ima (mdeaths , order=c (1,0,0)) $coef 

>  lines (as . vector (time (mdeaths) ) [-1] ,  ar[2]+ar[l]* 

+  (mdeaths [-length (mdeaths)] -ar  [2] ) ,col="blue" , lwd=2 , lty=2) 

>  title ( "Splines  versus  AR(1)  predictor") 

>  legend ( 1974, 2800, legend=c(" spline" , "AR(1) ") ,col=c("red" , 

+  "blue") ,lty=c(3,2) ,lwd=c(2,2) ,cex=.5) 

represented  in  Fig.  1.6,  which  compares  spline  fitting  to  an  AR(1)  predictor 
and  to  an  SAR(1,12)  predictor.  Note  that  the  seasonal  model  is  doing  worse. 


Splines  versus  AR(1)  predictor 
and  SAR(1,12)  predictor 


Fig.  1.6.  Monthly  deaths  from  bronchitis  in  the  UK  over  the  period  1974-1980 
and  fits  by  a  spline  approximation  and  an  AR  predictor 
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Useful  graphical  functions  include 

hist  for  constructing  and  optionally  plotting  histograms  of  datasets; 

points  for  adding  points  on  an  existing  graph; 

lines  for  linking  points  together  on  an  existing  graph; 

polygon  for  filling  the  area  between  two  sets  of  points; 

barplot  for  creating  barplots; 

boxplot  for  creating  boxplots. 

The  two-dimensional  representations  offered  by  image  and  contour  are  quite 
handy  when  providing  likelihood  or  posterior  surfaces.  Figure  1.7  gives  some 
of  the  most  usual  graphical  commands. 


>  x=rnorm(100) 

>  hist (x,nclass=10,  prob=T) 

>  curve (dnorm(x) , add=T) 

>  y=2*x+rnorm(100,0,2) 

>  plot (x,y , xlim=c (-5,5) ,ylim=c (-10,10)) 

>  lines(c(0,0) ,c(l,2) , col="sienna3") 

>  boxplot (x) 


compute  and  plot  an  histogram 
of  x 

draw  the  normal  density  on  top 

draw  a  scatterplot  of  x  against  y 

compute  and  plot 
a  box-and-whiskers  plot  of  x 


>  state=c ("tas" , "tas" , "sa" , "sa" , "wa" , "sa") 

>  statef=f actor (state) 

>  barplot  (table  (statef) )  draw  a  bar  diagram  of  x 


Fig.  1.7.  Some  standard  plotting  commands 


1.4.5  Writing  New  R  Functions 

One  of  the  strengths  of  R  is  that  new  functions  and  libraries  can  be  created 
by  anyone  and  then  added  to  Web  depositories  to  continuously  enrich  the 
language.  These  new  functions  are  not  distinguishable  from  the  core  functions 
of  R,  such  as  median ()  or  var(),  because  those  are  also  written  in  R.  This 
means  their  code  can  be  accessed  and  potentially  modified,  although  it  is 
safer  to  define  new  functions.  (A  few  functions  are  written  in  C,  though,  for 
efficiency.)  Learning  how  to  write  functions  designed  for  one’s  own  problems 
is  paramount  for  their  resolution,  even  though  the  huge  collection  of  available 
R  functions  may  often  contain  a  function  already  written  for  that  purpose. 

A  function  is  defined  in  R  by  an  assignment  of  the  form 

name=function(argl [=exprl] , arg2  [=expr2]  , . . .)  { 

expression 


expression 

value 

> 
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where  expression  denotes  an  R  command  that  uses  some  of  the  arguments 
argl,  arg2,  ...  to  calculate  a  value,  value,  that  is  the  outcome  of  the 
function.  The  braces  indicate  the  beginning  and  the  end  of  the  function  and 
the  brackets  some  possible  default  values  for  the  arguments.  Note  that  pro¬ 
ducing  a  value  at  the  end  of  a  function  is  essential  because  anything  done 
within  a  function  is  local  and  temporary,  and  therefore  lost  once  the  function 
has  been  exited,  unless  saved  in  value  (hence,  again,  the  appeal  of  listO). 
For  instance,  the  following  function,  named  sqrnt,  implements  a  version  of 
Newton’s  method  for  calculating  the  square  root  of  y: 

sqrnt=function(y)  { 
x=y/2 

while  (abs(x*x-y)  >  le-10)  x=(x+y/x)/2 
x 
> 

When  designing  a  new  R  function,  it  is  more  convenient  to  use  an  external 
text  editor  and  to  store  the  function  under  development  in  an  external  hie, 
say  myf unction .  R,  which  can  be  executed  in  R  as  source  ( "myf unction .  R" ) . 
Note  also  that  some  external  commands  can  be  launched  within  an  R  function 
via  the  very  handy  command  systemO.  This  is,  for  instance,  the  easiest  way 
to  incorporate  programs  written  in  other  languages  (e.g.,  Fortran,  C,  Matlab) 
within  R  programs. 

Without  getting  deeply  into  R  programming,  let  us  note  a  distinction  be¬ 
tween  global  and  local  variables:  the  former  are  defined  in  the  core  of  the 
R  code  and  are  recognized  everywhere,  while  the  later  are  only  defined  within  a 
specific  function.  This  means  in  particular  that  a  local  variable,  locax  say,  ini¬ 
tialized  within  a  function,  myfunc  say,  will  not  be  recognized  outside  myfunc. 
(It  will  not  even  be  recognized  in  a  function  defined  within  myfunc.) 

The  expressions  used  in  a  function  rely  on  a  syntax  that  is  quite  similar  to 
those  of  other  programming  languages,  with  conditional  statements  such  as 

if  (expresl)  expres2  else  expres3 

where  expresl  is  a  logical  value,  and  loops  such  as 

for  (name  in  expresl)  expres2 

and 

while  (name  in  expresl)  expres2 

where  expresl  is  a  collection  of  values,  as  illustrated  in  Fig.  1.8.  In  partic¬ 
ular,  Boolean  operators  can  be  used  within  those  expressions,  including  == 
for  testing  equality,  !=  for  testing  inequality,  &  for  the  logical  and,  I  for  the 
logical  or,  and  !  for  the  logical  contradiction. 

Since  R  is  an  interpreted  language,  avoiding  loops  is  generally  a  good  idea, 
but  this  may  render  programs  much  harder  to  read.  It  is  therefore  extremely 
useful  to  include  comments  within  the  programs  by  using  the  symbol  #. 
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>  bool=T;i=0 

>  while (bool==T)  {i=i+l;  bool=(i<10)} 

>  s=0 ; x=rnorm( 10000) 

>  system . time (for  (i  in  1 : length(x) ) { 
+  s=s+x  [i] })  [3] 

>  system .  time  (t  (rep  (1 , 10000) )  °/o*°/0x)  [3] 

>  system. time (sum(x) ) [3] 


separate  commands  by  semicolons 
stop  at  i  =  11 

output  sum(x)  and 
provide  computing  time 
compare  with  vector  product 
compare  with  sum()  efficiency 


Fig.  1.8.  Some  artificial  loops  in  R 


1.4.6  Input  and  Output  in  R 

Large  data  objects  need  to  be  read  as  values  from  external  files  rather  than 
entered  during  an  R  session  at  the  keyboard  (or  by  cut-and-paste).  Input 
facilities  are  simple,  but  their  requirements  are  fairly  strict.  In  fact,  there  is 
a  clear  presumption  that  it  is  possible  to  modify  input  files  using  other  tools 
outside  R. 

An  entire  data  frame  can  be  read  directly  with  the  read .  table  ()  function. 
Plain  files  containing  rows  of  values  with  a  single  mode  can  be  downloaded 
using  the  scan()  function,  as  in 

>  a=matrix(scan("myf ile") ,nrow=5 ,byrow=T) 

When  data  frames  have  been  produced  by  another  statistical  software,  the 
library  foreign  can  be  used  to  input  those  frames  in  R.  For  example,  the 
function  read.spssO  allows  ones  to  read  SPSS  data  frames. 

Conversely,  the  generic  function  save()  can  be  used  to  store  all  R  objects 
in  a  given  file,  either  in  binary  or  ASCII  format.  (The  alternative  function 
dumpO  is  more  rudimentary  but  also  useful.)  The  function  write .  table  ()  is 
used  to  export  R  data  frames  as  ASCII  hies. 

1.4.7  Administration  of  R  Objects 

During  an  R  session,  objects  are  created  and  stored  by  name.  The  command 
objects ()  (or,  alternatively,  ls())  can  be  used  to  display,  within  a  directory 
called  the  workspace ,  the  names  of  the  objects  that  are  currently  stored.  In¬ 
dividual  objects  can  be  deleted  with  the  function  rm().  Removing  all  objects 
created  so  far  is  done  by  rm(list=ls  () ) . 

All  objects  created  during  an  R  session  (including  functions)  can  be  stored 
permanently  in  a  hie  in  provision  of  future  R  sessions.  At  the  end  of  each  R 
session,  obtained  by  the  command  quit  ()  (which  can  be  abbreviated  as  q() ), 
the  user  is  given  the  opportunity  to  save  all  the  currently  available  objects, 


as  m 


>q() 

Save  workspace  image?  [y/n/c] : 
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If  the  user  answers  y,  the  object  created  during  the  current  session  and  those 
saved  from  earlier  sessions  are  saved  in  a  file  called  .RData  and  located  in  the 
working  directory.  When  R  is  called  again,  it  reloads  the  workspace  from  this 
file,  which  means  that  the  user  starts  the  new  session  exactly  where  the  old 
one  had  stopped.  In  addition,  the  entire  past  command  history  is  stored  in 
the  hie  .Rhistory  and  can  be  used  in  the  current  or  in  later  sessions  by  using 
the  command  history  (). 


1.5  The  bayess  Package 

Since  this  is  originally  a  paper  book,  copying  by  hand  the  R  code  represented 
on  the  following  pages  to  your  computer  terminal  would  be  both  tedious  and 
time-wasting.  We  have  therefore  gathered  all  the  programs  and  codes  of  this 
book  within  an  R  package  called  bayess  that  you  should  download  from  CRAN 
before  proceeding  to  the  next  chapter.  Once  downloaded  on  your  computer  fol¬ 
lowing  the  instructions  provided  on  the  CRAN  Webpage,  the  package  bayess 
is  loaded  into  your  current  R  session  by  library  (bayess) .  All  the  functions 
defined  inside  the  package  are  then  available,  and  so  is  a  step-by-step  repro¬ 
duction  of  the  examples  provided  in  the  book,  using  the  demo  command: 

>  demo (Chapter . 1) 

demo (Chapter . 1) 


Type  <Return>  to  start  : 

>  #  Chapter  1  R  commands 

> 

>  #  Section  1.4.2 

> 

>  str(log) 

function  (x,  base  =  exp(l)) 

>  a=c (2 , 6 , -4, 9 , 18) 

>  x  <-  c(3,6,9) 

>  d=a [c (1 , 3 , 5) ] 

>  e=3/d 


>  e=lgaimna(e~2) 
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>  S=readline (prompt="Type  <Return>  to  continue  :  ") 

Type  <Return>  to  continue  : 

and  similarly  for  the  following  chapters.  Obviously,  all  commands  contained 
in  the  demonstrations  and  all  functions  defined  in  the  package  can  be  accessed 
and  modified. 

^  Although  most  steps  of  the  demonstrations  are  short,  some  may  require  longer 
execution  times.  If  you  need  to  interrupt  the  demonstration,  recall  that  Ctrl-C 
is  an  interruption  command. 
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This  was  where  the  work  really  took  place. 

— Ian  Rankin,  Knots  &  Crosses. — 


Roadmap 

This  chapter  uses  the  standard  normal  ^(/x,  < r2)  distribution  as  an  easy  entry  to 
generic  Bayesian  inferential  methods.  As  in  every  subsequent  chapter,  we  start 
with  a  description  of  the  data  used  as  a  chapter  benchmark  for  illustrating  new 
methods  and  for  testing  assimilation  of  the  techniques.  We  then  propose  a  cor¬ 
responding  statistical  model  centered  on  the  normal  distribution  and  consider 
specific  inferential  questions  to  address  at  this  level,  namely  parameter  estima¬ 
tion,  model  choice,  and  outlier  detection,  once  set  the  description  of  the  Bayesian 
resolution  of  inferential  problems.  As  befits  a  first  chapter,  we  also  introduce  here 
general  computational  techniques  known  as  Monte  Carlo  methods. 


J.-M.  Marin  and  C.P.  Robert,  Bayesian  Essentials  with  R ,  Springer  Texts 
in  Statistics,  DOI  10. 1007/978- 1-4614-8687-9_2, 

©  Springer  Science+Business  Media  New  York  2014 
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2.1  Normal  Modeling 

The  normal  (or  Gaussian)  distribution  Nb(/i,cr2),  with  density  on  R, 

1 

y/2l TCT 

is  certainly  one  of  the  most  studied  and  one  of  the  most  used  distributions 
because  of  its  “normality”:  It  appears  both  as  the  limit  of  additive  small 
effects  and  as  a  representation  of  symmetric  phenomena  without  long  tails, 
and  it  offers  many  openings  in  terms  of  analytical  properties  and  closed-form 
computations.  As  such,  it  is  thus  the  natural  opening  to  a  modeling  course, 
even  more  than  discrete  and  apparently  simpler  models  such  as  the  binomial 
and  Poisson  models  we  will  discuss  in  the  following  chapters.  Note,  however, 
that  we  do  not  advocate  at  this  stage  the  use  of  the  normal  distribution  as 
a  one-fits-all  model:  There  exist  many  continuous  situations  where  a  normal 
model  is  inappropriate  for  many  possible  reasons  (e.g.,  skewness,  fat  tails, 
dependence,  multimodality) . 


f(x\n,a) 


H — ' 

CO 

c 

CD 

Q 


-0.8  -0.6  -0.4  -0.2  0.0  0.2  0.4  0.6 


Fig.  2.1.  Dataset  normaldata:  Histogram  of  the  observed  fringe  shifts  in  Illing¬ 
worth’s  1927  experiment 
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Our  normal  dataset,  normaldata,  is  linked  with  the  famous  Michelson- 
Morlay  experiment  that  opened  the  way  to  Einstein’s  relativity  theory  in 
1887.  The  experiment  was  intended  to  detect  the  “aether  flow”  and  hence 
the  existence  of  aether,  this  theoretical  medium  physicists  postulated  at  this 
epoch  was  necessary  to  the  transmission  of  light.  Michelson’s  measuring  de¬ 
vice  consisted  in  measuring  the  difference  in  the  speeds  of  two  light  beams 
travelling  the  same  distance  in  two  orthogonal  directions.  As  often  in  physics, 
the  measurement  was  done  by  interferometry  and  differences  in  the  travelling 
time  inferred  from  shift  in  the  fringes  of  the  light  spectrum.  However,  the 
experiment  produced  very  small  measurements  that  were  not  conclusive  for 
the  detection  of  the  aether.  Later  experiments  tried  to  achieve  higher  preci¬ 
sion,  as  the  one  by  Illingworth  in  1927  used  here  as  normaldata,  only  to 
obtain  smaller  and  smaller  upper  bounds  on  the  aether  windspeed.  While  the 
original  dataset  is  available  in  R  as  morley,  the  entries  are  approximated  to 
the  nearest  multiple  of  ten  and  are  therefore  difficult  to  analyze  as  normal 
observations. 

The  64  data  points  in  normaldata  are  associated  with  session  numbers 
(first  column),  corresponding  to  different  times  of  the  day,  and  the  values  in  the 
second  column  represent  the  averaged  fringe  displacement  due  to  orientation 
taken  over  ten  measurements  made  by  Illingworth,  who  assumed  a  normal 
error  model.  Figure  2.1  produces  an  histogram  of  the  data  by  the  simple  R 
commands 

>  data (normaldata) 

>  shif t=normaldata [ ,2] 

>  hist (shift ,nclass=10 , col="steelblue" ,prob=TRUE,main=" ") 

This  histogram  seems  compatible  with  a  symmetric  unimodal  distribution 
such  as  the  normal  distribution.  As  shown  in  Fig.  2.2  by  a  qq-plot  obtained 
by  the  commands 

>  qqnorm(  (shif t-mean(shif t)  ) /sd(shift)  ,pch=19 ,  col="gold2" ) 

>  abline (a=0 ,b=l ,lty=2, col="indianred" ,lwd=2) 

which  compare  the  empirical  cdf  with  a  pluggin  normal  estimate,  The 
W(/i,  a 2)  fit  may  not  be  perfect,  though,  because  of  (a)  a  possible  bimodality 
of  the  histogram  and  (b)  potential  outliers. 


As  mentioned  above,  the  use  of  a  normal  distribution  for  modeling  a  given 
dataset  is  a  convenient  device  that  does  not  need  to  correspond  to  a  perfect  fit. 
With  some  degree  of  approximation,  the  normal  distribution  may  agree  with 
the  data  sufficiently  to  be  used  in  place  of  the  true  distribution  (if  any).  There 
exist,  however,  some  setups  where  the  normal  distribution  is  thought  to  be  the 
exact  distribution  behind  the  dataset  (or  where  departure  from  normality  has 
a  significance  for  the  theory  behind  the  observations).  In  Marin  and  Robert 
(2007),  we  introduced  a  huge  dataset  related  to  the  astronomical  concept  of 
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Normal  Q-Q  Plot 


Theoretical  Quantiles 

Fig.  2.2.  Dataset  normaldata:  qq-plot  of  the  observed  fringe  shifts  against  the 
normal  quantiles 


the  cosmological  background  noise  that  illustrated  this  point,  but  chose  not 
to  reproduce  the  set  in  this  edition  due  to  the  difficulty  in  handling  it. 


2.2  The  Bayesian  Toolkit 

2.2.1  Posterior  Distribution 

Given  an  independent  and  identically  distributed  (later  abbreviated  as  iid) 
sample  S>n  =  (pc\, . . .  ,xn)  from  a  density  f(x\6),  depending  upon  an  un¬ 
known  parameter  6  G  (9,  for  instance  the  mean  fi  of  the  benchmark  normal 
distribution,  the  associated  likelihood  function  is 

n 

w»)=n /(*<!«)•  (2-i) 

i— 1 

This  function  of  6  is  a  fundamental  entity  for  the  analysis  of  the  information 
provided  about  6  by  the  sample  f9n,  and  Bayesian  analysis  relies  on  (2.1)  to 
draw  its  inference  on  0.  For  instance,  when  i9n  is  a  normal  c/F(/x,  cr2)  sample 
of  size  n  and  6  =  (/x,  a2),  we  get 

n 

l(0\$tn)  =  ]^[exp{-(Xi  -  /i)2/2cr2}/v/27TCr 

i— 1 
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oc  exp  —  /i)2/2cr2  j>  / an 

<|  —  ^n/i2  —  2 nxp  +  £2^ 


2(7 


oc  exp 

ocexp{—  n(/i  —  T)2  +  s2  /2 cr2} /crn, 


2  v  7<rn 


where  x  denotes  the  empirical  mean  and  where  s2  is  the  sum  —  x)2- 

This  shows  in  particular  that  x  and  s2  are  sufficient  statistics. 


^  In  the  above  display  of  equations,  the  sign  oc  means  proportional  to.  This 
proportionality  is  understood  for  functions  of  0,  meaning  that  the  discarded 
constants  do  not  depend  on  0  but  may  well  depend  on  the  data  @n.  This 
shortcut  is  both  handy  in  complex  Bayesian  derivations  and  fraught  with  danger 
when  considering  several  levels  of  parameters. 


The  major  input  of  the  Bayesian  approach,  compared  with  a  traditional 
likelihood  approach,  is  that  it  modifies  the  likelihood  function  into  a  posterior 
distribution,  which  is  a  valid  probability  distribution  on  O  defined  by  the 
classical  Bayes’  formula  (or  theorem) 


7r(6»|^„) 


f  £(6\^n)n(8)  d6  ' 


(2.2) 


The  factor  i t(9)  in  (2.2)  is  called  the  prior  and  it  obviously  has  to  be  chosen 
to  start  the  analysis. 


2  The  posterior  density  is  a  probability  density  on  the  parameter,  which  does  not 
mean  the  parameter  6  need  be  a  genuine  random  variable.  This  density  is  used 
as  an  inferential  tool,  not  as  a  truthful  representation. 


A  first  motivation  for  this  approach  is  that  the  prior  distribution  sum¬ 
marizes  the  prior  information  on  9 ;  that  is,  the  knowledge  that  is  available 
on  0  prior  to  the  observation  of  the  sample  Q)n.  However,  the  choice  of  it(6) 
is  often  decided  on  practical  grounds  rather  than  strong  subjective  beliefs  or 
overwhelming  prior  information.  A  second  motivation  for  the  Bayesian  con¬ 
struct  is  therefore  to  provide  a  fully  probabilistic  framework  for  the  inferential 
analysis,  with  respect  to  a  reference  measure  i r(0). 

As  an  illustration,  consider  the  simplest  case  of  the  normal  distribution 
with  known  variance,  ^H(/i,(j2).  If  the  prior  distribution  on  /i,  7r(/i),  is  the 
normal  Tb  (0,cr2),  the  posterior  distribution  is  easily  derived  via  Bayes’  the¬ 
orem 


7r(/i|^n)  (X  7t(/x)£(0\ $>n) 

oc  exp{— p2 /2a2}  exp  {—  n(x  —  p)2 /2cr2} 
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oc  exp  {  —  (n  +  l)/i2/2(j2  +  2n/ix/2cr2} 
oc  exp  {  —  (n  +  l)[fjb  —  nx/(n  +  l)]2/2cr2}  , 

which  means  that  this  posterior  distribution  in  fi  is  a  normal  distribution 
with  mean  nx/(n  +  1)  and  variance  a2 / (n  +  1).  The  mean  (and  mode)  of 
the  posterior  is  therefore  different  from  the  classical  estimator  x,  which  may 
seem  as  a  paradoxical  feature  of  this  Bayesian  analysis.  The  reason  for  the 
difference  is  that  the  prior  information  that  fi  is  close  enough  to  zero  is  taken 
into  account  by  the  posterior  distribution,  which  thus  shrinks  the  original 
estimate  towards  zero.  If  we  were  given  an  alternative  information  that  /x  was 
close  to  ten,  the  posterior  distribution  would  similarly  shrink  fi  towards  ten. 
The  change  from  a  factor  n  to  a  factor  (n+1)  in  the  (posterior)  variance 
is  similarly  explained  by  the  prior  information,  in  that  accounting  for  this 
information  reduces  the  variability  of  our  answer. 

For  normaldata,  we  can  first  assume  that  the  value  of  a  is  the  variability 
of  the  Michelson-Morley  apparatus,  namely  0.75.  In  that  case,  the  posterior 
distribution  on  the  fringe  shift  average  fi  is  a  normal  jY(nx/ (n+1),  cr2/(n+l)) 
distribution,  hence  with  mean  and  variance 

>  n=length(shif t) 

>  mmu=sum(shif t) / (n+1) ;  mmu 

[1]  -0.01461538 

>  vmu=0 . 75~2/ (n+1) ;  vmu 

[1]  0.008653846 

represented  on  Fig.  2.3  as  a  dotted  curve. 


The  case  of  a  normal  distribution  with  a  known  variance  being  quite  un¬ 
realistic,  we  now  consider  the  general  case  of  an  iid  sample  Q)n  =  (aq, . . . ,  xn) 
from  the  normal  distribution  +F(/i,cr2)  and  6  =  (/qcr2).  Keeping  the  same 
prior  distribution  nF  (0,  ct2)  on  //,  which  then  appears  as  a  conditional  distri¬ 
bution  of  fi  given  cr2,  be.,  relies  on  the  generic  decomposition 

7r(/i,cr2)  =  7r(/i|cr2)7r(cr2) , 


we  have  to  introduce  a  further  prior  distribution  on  a2 .  To  make  computations 
simple  at  this  early  stage,  we  choose  an  exponential  <f(l)  distribution  on 
a~2 .  This  means  that  the  random  variable  uo  =  cr~2  is  distributed  from  an 
exponential  <?(1)  distribution,  the  distribution  on  a2  being  derived  by  the 
usual  change  of  variable  technique, 


7 r 


exp  (—a  2) 


da”2 

dcr2 


(This  distribution  is  a  special  case  of  an  inverse  gamma  distribution,  namely 
XQ(  1, 1).)  The  corresponding  posterior  density  on  0  is  then  given  by 
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7r((/T  <j2)\^n)  OC  7r(^2)  X  7r(/i|cr2)  X  7((/i,  Cr2)  |^n) 

oc  (cr_2)1/2+2  exp  {  — (/i2+2)/2cr2} 

x (cr_2)n/2  exp  {—  (n(/i  —  T)2+s2)  /2ct2} 
oc  (cr2)_(^n+5^2  exp  {—  [(n+l)(/x  —  nx/ (n+l))2-\-(2+s2)]  /2cr2} 
oc  (cr2)-1/2  exp  {  —  (n+l)[fjL  —  nT/(n+l)]2/2cr2}  . 
x  (cr2)-(n+2)/2-1  exp  {  — (2-fs2)/2cr2}  . 


Therefore,  the  posterior  on  0  can  be  decomposed  as  the  product  of  an  inverse 
gamma  distribution  on  cr2,  +  2)/2,  [2  +  «s2] / 2) — which  is  the  distribu¬ 

tion  of  the  inverse  of  a  gamma  £f((n  +  2) / 2,  [2-\-  s2}/2)  random  variable — and, 
conditionally  on  cr2,  a  normal  distribution  on  /r,  W(nT/(n  +  l),cr2/(n  +  1)). 
The  interpretation  of  this  posterior  is  quite  similar  to  the  case  when  cr  is 
known,  with  the  difference  that  the  variability  in  a  induces  more  variabil¬ 
ity  in  /r,  the  marginal  posterior  in  p  being  then  a  Student’s  t  distribution 
(Exercise  2.1) 


~  2F  (n  +  2,  nx/(n  +  1),  (2  +  s2)/{n  +  l)(n  +  2)) 


with  n  +  2  degrees  of  freedom,  a  location  parameter  proportional  to  x  and  a 
scale  parameter  (almost)  proportional  to  s. 

For  normaldata,  an  £>xp(  1)  prior  on  cr-2  being  compatible  with  the  value 
observed  on  the  Michelson-Morley  experiment,  the  parameters  of  the  t  distri¬ 
bution  on  (i  are  therefore  n  =  64, 

>  mtmu=sum (shift)/ (n+1) ;mtmu 
[1]  -0.01461538 

>  stmu= (2+ (n-1) *var (shift) )/ ( (n+2) * (n+1) ) ; stmu 
[1]  0.0010841496 

We  compare  the  resulting  posterior  with  the  one  based  on  the  assumption 
cr  =  0.75  on  Fig.  2.3,  using  the  curve  commands  (note  that  the  mnormt  li¬ 
brary  may  require  the  preliminary  installation  of  the  corresponding  package 

by  install .packages ("mnormt")): 

>  library (mnormt) 

>  curve (dmt (x,mean=mmu,S=stmu,df=n+2) , col="chocolate2" , lwd=2, 
+  xlab="x" ,ylab=" " ,xlim=c (- . 5 , . 5) ) 

>  curve (dnorm (x , mean=mmu , sd=sqrt ( vmu) ) , col= " steelblue2 " , 

+  lwd=2 , add=TRUE , lty=2) 


1We  will  omit  the  reference  to  Student  in  the  subsequent  uses  of  this  distribution, 

as  is  the  rule  in  anglo-saxon  textbooks. 
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X 

Fig.  2.3.  Dataset  normaldata:  Two  posterior  distributions  on  /a  corresponding 
to  an  hypothetical  a  =  0.75  (dashed  lines)  and  to  an  unknown  a2  under  the  prior 
<j~2  r^j  <^(1)  (plain  lines) 


Although  this  may  sound  counterintuitive,  in  this  very  case,  estimating  the 
variance  produces  a  reduction  in  the  variability  of  the  posterior  distribution 
on  /i.  This  is  because  the  postulated  value  of  a2  is  actually  inappropriate  for 
Illingworth’s  experiment,  being  far  too  large.  Since  the  posterior  distribution 
on  a2  is  an  33, 1.82)  distribution  for  normaldata,  the  probability  that 
(7  is  as  large  as  0.75  can  be  evaluated  as 

>  digmma=f unct ion (x, shape , scale ){dgamma(l/x, shape , scale) /x~2} 

>  curve(digmma(x,shape=33,scale=(l+(n+l)*var(shift))/2) , 

+  xlim=c (0 , . 2) , lwd=2) 

>  pgammad/  ( .  75)  "2 , shape=33, scale=(l+  (n+1)  *var (shift) ) /2) 

[1]  8.99453e-39 

which  shows  that  0.75  is  quite  unrealistic,  being  ten  times  as  large  as  the 
mode  of  the  posterior  density  on  a2 . 


The  above  R  command  library  (mnormt)  calls  the  mnormt  library,  which 
contains  useful  additional  functions  related  with  multivariate  normal  and  t 
distributions.  In  particular,  dmt  allows  for  location  and  scale  parameters  in  the 
t  distribution.  Note  also  that  s2  is  computed  as  (n-1)  *var (shift)  because  R 
implicitly  adopts  a  classical  approach  in  using  the  “best  unbiased  estimator” 
of  a2 . 
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2.2.2  Bayesian  Estimates 


A  concept  that  is  at  the  core  of  Bayesian  analysis  is  that  one  should  provide 
an  inferential  assessment  conditional  on  the  realized  value  of  Q)n.  Bayesian 
analysis  gives  a  proper  probabilistic  meaning  to  this  conditioning  by  allocating 
to  0  a  probability  distribution.  Once  the  prior  distribution  is  selected,  Bayesian 
inference  formally  is  “over”;  that  is,  it  is  completely  determined  since  the 
estimation,  testing,  and  evaluation  procedures  are  automatically  provided  by 

the  prior  and  the  way  procedures  are  compared  (or  penalized).  For  instance, 

/\ 

if  estimations  6  of  6  are  compared  via  the  sum  of  squared  errors, 


L(M) 


the  corresponding  Bayes  optimum  is  the  expected  value  of  6  under  the  posterior 
distribution,2 


0n(0\^n)d0 


f  6  £{6\&n)  7r(0)  d 0 

f  £(0\@n)  tt(6)  dO  ’ 


for  a  given  sample  @n. 

When  no  specific  penalty  criterion  is  available,  the  estimator  (2.3)  is  of¬ 
ten  used  as  a  default  estimator,  although  alternatives  are  also  available.  For 
instance,  the  maximum  a  posteriori  estimator  (MAP)  is  defined  as 


0  =  argmaxg,  7r(0\^n)  =  argmaxg,  7r(6)£(6\@n),  (2.4) 


where  the  function  to  maximize  is  usually  provided  in  closed  form.  However, 
numerical  problems  often  make  the  optimization  involved  in  finding  the  MAP 
far  from  trivial.  Note  also  here  the  similarity  of  (2.4)  with  the  maximum 
likelihood  estimator  (MLE):  The  influence  of  the  prior  distribution  i r(0)  on 
the  estimate  progressively  disappears  as  the  number  of  observations  n  in¬ 
creases,  and  the  MAP  estimator  often  recovers  the  asymptotic  properties  of 
the  MLE. 


For  normaldata,  since  the  posterior  distribution  on  cr~2  is  a  £f(32, 1.82) 
distribution,  the  posterior  expectation  of  a~2  given  Illingworth’s  experimental 
data  is  32/1.82  =  17.53.  The  posterior  expectation  of  a2  requires  a  supple¬ 
mentary  effort  in  order  to  derive  the  mean  of  an  inverse  gamma  distribution 
(see  Exercise  2.2),  namely 


1.82/(33  -  1)  =  0.057. 


Estimators  are  functions  of  the  data  @n,  while  estimates  are  values  taken  by 
those  functions.  In  most  cases,  we  will  denote  them  with  a  “hat”  symbol,  the  de¬ 
pendence  on  being  implicit. 
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Similarly,  the  MAP  estimate  is  given  here  by 


/  O 

argmaxfl  7r(cr  S>n) 


=  1.82/(33+  1)  =  0.054 


(see  also  Exercise  2.2).  These  values  therefore  reinforce  our  observation  that 
the  Michelson-Morley  precision  is  not  appropriate  for  the  Illingworth  experi¬ 
ment,  which  is  much  more  precise  indeed. 


2.2.3  Conjugate  Prior  Distributions 

The  selection  of  the  prior  distribution  is  an  important  issue  in  Bayesian  statis¬ 
tics.  When  prior  information  is  available  about  the  data  or  the  model,  it  can 
(and  must)  be  used  in  building  the  prior,  and  we  will  see  some  implementa¬ 
tions  of  this  recommendation  in  the  following  chapters.  In  many  situations, 
however,  the  selection  of  the  prior  distribution  is  quite  delicate,  due  to  the 
absence  of  reliable  prior  information,  and  default  solutions  must  be  chosen 
instead.  Since  the  choice  of  the  prior  distribution  has  a  considerable  influence 
on  the  resulting  inference,  this  inferential  step  must  be  conducted  with  the 
utmost  care. 

From  a  computational  viewpoint,  the  most  convenient  choice  of  prior  dis¬ 
tributions  is  to  mimic  the  likelihood  structure  within  the  prior.  In  the  most 
advantageous  cases,  priors  and  posteriors  remain  within  the  same  param¬ 
eterized  family.  Such  priors  are  called  conjugate.  While  the  foundations  of 
this  principle  are  too  advanced  to  be  processed  here  (see,  e.g.,  Robert,  2007, 
Chap.  3),  such  priors  exist  for  most  usual  families,  including  the  normal  dis¬ 
tribution.  Indeed,  as  seen  in  Sect.  2.2.1,  when  the  prior  on  a  normal  mean  is 
normal,  the  corresponding  posterior  is  also  normal. 

Since  conjugate  priors  are  such  that  the  prior  and  posterior  densities  be¬ 
long  to  the  same  parametric  family,  using  the  observations  boils  down  to  an 
update  of  the  parameters  of  the  prior.  To  avoid  confusion,  the  parameters 
involved  in  the  prior  distribution  on  the  model  parameter  are  usually  called 
hyperparameters.  (They  can  themselves  be  associated  with  prior  distributions, 
then  called  hyperpriors.) 

For  most  practical  purposes,  it  is  sufficient  to  consider  the  conjugate  priors 
described  in  Table  2.1.  The  derivation  of  each  row  is  straightforward  if  painful 
and  proceeds  from  the  same  application  of  Bayes’  formula  as  for  the  normal 
case  above  (Exercise  2.5).  For  distributions  that  are  not  within  this  table,  a 
conjugate  prior  may  or  may  not  be  available  (Exercise  2.6). 

An  important  feature  of  conjugate  priors  is  that  one  has  a  priori  to  select 
two  hyperparameters,  e.g.,  a  mean  and  a  variance  in  the  normal  case.  On  the 
one  hand,  this  is  an  advantage  when  using  a  conjugate  prior,  namely  that  one 
has  to  select  only  a  few  parameters  to  determine  the  prior  distribution.  On 
the  other  hand,  this  is  a  drawback  in  that  the  information  known  a  priori  on  /i 
may  be  either  insufficient  to  determine  both  parameters  or  incompatible  with 
the  structure  imposed  by  conjugacy. 
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Table  2.1.  Conjugate  priors  for  the  most  common  statistical  families 


f{x \0) 

n{9) 

7 t(0\x) 

Normal 

Normal 

W(0,  a2) 

^(P,T2) 

W(p(cr2  p  +  r2x),  pa2r2) 

-1  2.2 
p  =  a  +  r 

Poisson 

Gamma 

&>(0) 

Sf  (a  +  x,  /3  +  1) 

Gamma 

Gamma 

&{v,0) 

Sf  (a  +  is,  [3  +  x) 

Binomial 

Beta 

dd(n,  0) 

dde(a1  f3) 

d$e(a  +  x,  (3  +  n  —  x) 

Negative  Binomial  Beta 

Te^m,  9) 

3§e(a,  f3) 

d$e(a  +  m,  (3  +  x) 

Multinomial 

Dirichlet 

x/^k  (ffi  ?  •  •  •  •>  @k  ) 

@(a i, . . . , 

otk)  @(ai  +  xi, . . . ,  ak  +  xk) 

Normal 

Gamma 

^0,  i /o) 

&(a,/3) 

Sf  (a  +  0.5,  [3  +  (p,  —  x)2 /2) 

2.2.4  Noninformat ive  Priors 

There  is  no  compelling  reason  to  choose  conjugate  priors  as  our  priors,  ex¬ 
cept  for  their  simplicity,  but  the  restrictive  aspect  of  conjugate  priors  can 
be  attenuated  by  using  hyperpriors  on  the  hyperparameters  themselves,  al¬ 
though  we  will  not  deal  with  this  additional  level  of  complexity  in  the  current 
chapter.  The  core  message  is  therefore  that  conjugate  priors  are  nice  to  work 
with,  but  require  a  hyperparameter  determination  that  may  prove  awkward 
in  some  settings  and  that  may  moreover  have  a  lasting  impact  on  the  resulting 
inference. 

Instead  of  using  conjugate  priors,  one  can  opt  for  a  completely  different 
perspective  and  rely  on  so-called  noninformative  priors  that  aim  at  attenuat¬ 
ing  the  impact  of  the  prior  on  the  resulting  inference.  These  priors  are  fun¬ 
damentally  defined  as  coherent  extensions  of  the  uniform  distribution.  Their 
purpose  is  to  provide  a  reference  measure  that  has  as  little  as  possible  bear¬ 
ing  on  the  inference  (relative  to  the  information  brought  by  the  likelihood). 
We  first  warn  the  reader  that,  for  unbounded  parameter  spaces,  the  den¬ 
sities  of  noninformative  priors  actually  fail  to  integrate  to  a  finite  number 
and  they  are  defined  instead  as  positive  measures.  While  this  sounds  like  an 
invalid  extension  of  the  probabilistic  framework,  it  is  quite  correct  to  def¬ 
ine  the  corresponding  posterior  distributions  by  (2.2),  as  long  as  the  integral 
in  the  denominator  is  finite  (almost  surely).  A  more  detailed  account  is  for 
instance  provided  in  Robert  (2007,  Sect.  1.5)  about  this  possibility  of  using 
cr-fmite  measures  (sometimes  called  improper  priors)  in  settings  where  true 
probability  prior  distributions  are  too  difficult  to  come  by  or  too  subjective 
to  be  accepted  by  all.  For  instance,  location  models 
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x  ^  p(x  —  0) 


are  usually  associated  with  flat  priors  tt(6)  =  1  (note  that  these  models  include 
the  normal  ,./F (0, 1)  as  a  special  case),  while  scale  models 


x  ~ 


are  usually  associated  with  the  log-transform  of  a  flat  prior,  that  is, 


7 t(9)  =  1/6  . 


In  a  more  general  setting,  the  (noninformative)  prior  favored  by  most  Bayesi- 
ans  is  the  so-called  Jeffreys  prior  f  which  is  related  to  the  Fisher  information 
matrix 


IF  (0)  =  var# 


fd\ogf{x\o)\ 
V  90  ) 


IF{0)  | 1/2 


where  \I\  denotes  the  determinant  of  the  matrix  I. 

Since  the  mean  \i  of  a  normal  model  is  a  location  parameter,  when  the 
variance  a2  is  known,  the  standard  choice  of  noninformative  parameter  is  an 
arbitrary  constant  i r(/r)  (taken  to  be  1  by  default).  Given  that  this  flat  prior 
formally  corresponds  to  the  limiting  case  r  =  oo  in  the  conjugate  normal 
prior,  it  is  easy  to  verify  that  this  noninformative  prior  is  associated  with  the 
posterior  distribution  1),  which  happens  to  be  the  likelihood  function 

in  that  case.  An  interesting  consequence  of  this  observation  is  that  the  MAP 
estimator  is  also  the  maximum  likelihood  estimator  in  that  (special)  case.  For 
the  general  case  when  6  =  (/i,cr2),  the  Fisher  information  matrix  leads  to 
the  Jeffreys  prior  i tj (0)  =  1/cr3  (Exercise  2.4).  The  corresponding  posterior 
distribution  on  (/i,cr2)  is  then 


7r((/i,  cr2)\5in)  oc  (a  2)(3+n)/2  exp  {—  (n(/i  —  x)2  +  s2)  /2cr2} 

oc  ( j~1  exp  {—  n{(i  —  x)2/2ct2}  x  (cr2)_(n+2)/2  exp 


5 


that  is, 

6  ~  JV  (x,  ct 2/n)  x  (n/2,  s2 / 2)  . 

a  product  of  a  conditional  normal  on  pi  by  an  inverse  gamma  on  a2 .  Therefore 
the  marginal  posterior  distribution  on  pt  is  a  t  distribution  (Exercise  2.1) 


pi\@n  ~  2?  (n,x,  s2/n2)) 


Q 

Harold  Jeffreys  was  an  English  geophysicist  who  developed  and  formalized 
Bayesian  methods  in  the  1930s  in  order  to  analyze  geophysical  data.  He  ended  up 
writing  an  influential  treatise  on  Bayesian  statistics  entitled  Theory  of  Probability. 
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For  normaldata,  the  difference  in  Fig.  2.3  between  the  noninformative 
solution  and  the  conjugate  posterior  is  minor,  but  it  expresses  that  the  prior 
distribution  <f(l)  on  a~2  is  not  very  appropriate  for  the  Illingworth  experi¬ 
ment,  since  it  does  not  put  enough  prior  weight  on  the  region  of  importance, 
i.e.  near  0.05.  As  a  result,  the  most  concentrated  posterior  is  (seemingly  para¬ 
doxically)  the  one  associated  with  the  noninformative  prior! 


^  A  major  (and  potentially  dangerous)  difference  between  proper  and  improper 
priors  is  that  the  posterior  distribution  associated  with  an  improper  prior  is  not 
necessarily  defined,  that  is,  it  may  happen  that 

f  Tr(8)e{8\$>n)d6  <  oo  (2.5) 

does  not  hold.  In  some  cases,  this  difficulty  disappears  when  the  sample  size  is 
large  enough.  In  others  (see  Chap.  6),  it  may  remain  whatever  the  sample  size. 
But  the  main  thing  is  that,  when  using  improper  priors,  condition  (2.5)  must 
always  be  checked. 


2.2.5  Bayesian  Credible  Intervals 


One  point  that  must  be  clear  from  the  beginning  is  that  the  Bayesian  approach 
is  a  complete  inferential  approach.  Therefore,  it  covers  confidence  evaluation, 
testing,  prediction,  model  checking,  and  point  estimation.  We  will  progres¬ 
sively  cover  the  different  facets  of  Bayesian  analysis  in  other  chapters  of  this 
book,  but  we  address  here  the  issue  of  confidence  intervals  because  it  is  rather 
a  straightforward  step  from  point  estimation. 

As  with  everything  else,  the  derivation  of  the  confidence  intervals  (or  con¬ 
fidence  regions  in  more  general  settings)  is  based  on  the  posterior  distribution 
7r(0\@n).  Since  the  Bayesian  approach  processes  6  as  a  random  variable,  a 
natural  definition  of  a  confidence  region  on  6  is  to  determine  C(&>n)  such  that 


7 t{6  e  C(@n) \S>n)  =  1  -  a 


where  a  is  a  predetermined  level  such  as  0.05. 4 

The  important  difference  with  a  traditional  perspective  in  (2.6)  is  that  the 
integration  is  done  over  the  parameter  space,  rather  than  over  the  observation 
space.  The  quantity  1  —  a  thus  corresponds  to  the  probability  that  a  random 
6  belongs  to  this  set  C(@n),  rather  than  to  the  probability  that  the  random 
set  contains  the  “true”  value  of  6.  Given  this  drift  in  the  interpretation  of  a 

Where  is  nothing  special  about  0.05  when  compared  with,  say,  0.87  or  0.12.  It 
is  just  that  the  famous  5  %  level  is  accepted  by  most  as  an  acceptable  level  of  error. 
If  the  context  of  the  analysis  tells  a  different  story,  another  value  for  a  (including 
one  that  may  even  depend  on  the  data)  should  be  chosen! 
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confidence  set  (rather  called  a  credible  set  by  Bayesians),  the  determination  of 
the  best5  credible  set  turns  out  to  be  easier  than  in  the  classical  sense:  indeed, 
this  set  simply  corresponds  to  the  values  of  6  with  the  highest  posterior  values, 


C(S>n)  =  {0]  n(6\S!n)  >  ka}  , 


where  ka  is  determined  by  the  coverage  constraint  (2.6).  This  region  is  called 
the  highest  posterior  density  (HPD)  region. 


For  normaldata,  since  the  marginal  posterior  distribution  on  p  associated 
with  the  Jeffreys  prior  is  the  t  distribution,  J7(n,  x,  s2/n2). 


7r(/i|^n)  oc  n(/ji  —  x)2  +  s 


21-(n+ 1)/2 


with  n  =  64  degrees  of  freedom.  Therefore,  due  to  the  symmetry  properties 
of  the  t  distribution,  the  95  %  credible  interval  on  p  is  centered  at  x  and  its 
range  is  derived  from  the  0.975  quantile  of  the  t  distribution  with  n  degrees 
of  freedom, 

>  qt ( . 975, df=n)*sqrt((n-l) *var (shift) /n~2) 

[1]  0.05082314 


since  the  mnormt  package  does  not  compute  quantiles.  The  resulting  confidence 
interval  is  therefore  given  by 

>  qt ( . 975 ,df =n) *sqrt ( (n-1) *var (shif t) /n~2) +mean (shift) 

[1]  0.03597939 

>  -qt ( . 975 ,df =n) *sqrt ( (n-1) *var (shift) /n~ 2) +mean( shift) 

[1]  -0.06566689 


i.e.  equal  to  [—0.066,  0.036].  In  conclusion,  the  value  0  belongs  to  this  credible 
interval  on  p  and  this  (noninformative)  Bayesian  analysis  of  normaldata 
shows  that,  indeed,  the  absence  of  aether  wind  is  not  infirmed  by  Illingworth’s 
experiment. 


^  While  the  shape  of  an  optimal  Bayesian  confidence  set  is  easily  derived,  the 
computation  of  either  the  bound  ka  or  the  set  C(@n)  may  be  too  challenging 
to  allow  an  analytic  construction  outside  conjugate  setups  (see  Exercise  2.11). 


2.3  Bayesian  Model  Choice 

Deciding  the  validity  of  some  assumptions  or  restrictions  on  the  parameter 
0  is  a  major  part  of  the  statistician’s  job.  In  classical  statistics,  this  type  of 

5In  the  sense  of  producing  the  smallest  possible  volume  with  a  given  coverage. 
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problems  goes  under  the  name  of  hypothesis  testing ,  following  the  framework 
set  by  Fisher,  Neyman  and  Pearson  in  the  1930s.  Hypothesis  testing  considers 
a  decision  problem  where  an  hypothesis  is  either  true  or  false  and  where  the 
answer  provided  by  the  statistician  is  also  a  statement  whether  or  not  the  hy¬ 
pothesis  is  true.  However,  we  deem  this  approach  to  be  too  formalized — even 
though  it  can  be  directly  reproduced  from  a  Bayesian  perspective,  as  shown 
in  Robert  (2007,  Chap.  5) — ,  we  strongly  favour  a  model  choice  philosophy, 
namely  that  two  or  more  models  are  proposed  in  parallel  and  assessed  in 
terms  of  their  respective  fits  of  the  data.  This  view  acknowledges  the  fact  that 
models  are  at  best  approximations  of  reality  and  it  does  not  aim  at  finding  a 
“true  model” ,  as  hypothesis  testing  may  do.  In  this  book,  we  will  thus  follow 
the  later  approach  and  take  the  stand  that  inference  problems  expressed  as 
hypothesis  testing  by  the  classical  statisticians  are  in  fact  comparisons  of  dif¬ 
ferent  models.  In  terms  of  numerical  outcomes,  both  perspectives — Bayesian 
hypothesis  testing  vs.  Bayesian  model  choice — are  exchangeable  but  we  al¬ 
ready  warn  the  reader  that,  while  the  Bayesian  solution  is  formally  very  close 
to  a  likelihood  (ratio)  statistic,  its  numerical  values  often  strongly  differ  from 
the  classical  solutions. 


2.3.1  The  Model  Index  as  a  Parameter 


The  essential  novelty  when  dealing  with  the  comparison  of  models  is  that  this 
issue  makes  the  model  itself  an  unknown  quantity  of  interest.  Therefore,  if  we 
are  comparing  two  or  more  models  with  indices  k  =  1,  2, . . . ,  J,  we  introduce  a 
model  indicator  9JI  taking  values  in  {1,  2, . . . ,  J}  and  representing  the  index  of 
the  “true”  model.  If  9Jl  =  /c,  then  the  data  Q)n  are  generated  from  a  statistical 
model  Mk  with  likelihood  £(0k\@n)  and  parameter  6 k  taking  its  value  in  a 
parameter  space  Ok •  An  obvious  illustration  is  when  opposing  two  standard 
parametric  families,  e.g.,  a  normal  family  against  a  t  family,  in  which  case 
J  =  2,  Oi  =  R  x  — for  mean  and  variance — and  O2  =  x  R  x  — for 
degree  of  freedom,  mean  and  variance — ,  but  this  framework  also  includes  soft 
or  hard  constraints  on  the  parameters,  as  for  instance  imposing  that  a  mean 
(i  is  positive. 

In  this  setting,  a  natural  Bayes  procedure  associated  with  a  prior  distri¬ 
bution  7 r  is  to  consider  the  posterior  probability 


py©t  =  k\2>n) 


•) 


i.e. ,  the  posterior  probability  that  the  model  index  is  fc,  and  select  the  index 
of  the  model  with  the  highest  posterior  probability  as  the  model  preferred 
by  the  data  @n.  This  representation  implies  that  the  prior  7r  is  defined  over 
the  collection  of  model  indices,  {1,2,...,J},  and,  conditionally  on  the  model 
index  9H,  on  the  corresponding  parameter  space,  Ok .  This  construction  may 
sound  both  artificial  and  incomplete,  as  there  is  no  prior  on  the  parameter  Ok 
unless  9JI  =  k,  but  it  nonetheless  perfectly  translates  the  problem  at  hand: 


40 


2  Normal  Models 


inference  on  0 is  meaningless  unless  this  is  the  parameter  of  the  correct  model. 
Furthermore,  the  quantity  of  interest  integrates  out  the  parameter,  since 


p?r(aji 


p?r(f m  =  k)  J  e(0k\^n)nk(ek)d0k 


^  We  believe  it  is  worth  emphasizing  the  above  point:  A  parameter  6k  associated 
with  a  model  does  not  have  a  statistical  meaning  outside  this  model.  This  means 
in  particular  that  the  notion  of  parameters  “common  to  all  models”  often  found 
in  the  literature,  including  the  Bayesian  literature,  is  not  acceptable  within  a 
model  choice  perspective.  Two  models  must  have  distinct  parameters,  if  only 
because  the  purpose  of  the  analysis  is  to  end  up  with  a  single  model. 


The  choice  of  the  prior  n  is  highly  dependent  on  the  value  of  the  prior 
model  probabilities  P7r(9Jt  =  k).  In  some  cases,  there  is  experimental  or  sub¬ 
jective  evidence  about  those  probabilities,  but  in  others,  we  are  forced  to  settle 
for  equal  weights  P7r(01t  =  k)  =  1/J.  For  instance,  given  a  single  observation 
x  ~  A/”(/i,  cr2)  from  a  normal  model  where  a2  is  known,  assuming  fi  N(^t2), 
the  posterior  distribution  7r(/i|x)  is  the  normal  distribution  A/*(£(x),  co2)  with 


a +  r2x 
a2  +  r2 


and 


<j2r2 


a2  +  r 


2  * 


If  the  question  of  interest  is  to  decide  whether  fi  is  negative  or  positive,  we 
can  directly  compute 


P7r(//  <  0|x)  =  P77 


< 


UJ 


(2.7) 


where  <P  is  the  normal  cdf.  This  computation  does  not  seem  to  follow  from 
the  principles  we  just  stated  but  it  is  only  a  matter  of  perspective  as  we 
can  derive  the  priors  on  both  models  from  the  original  prior.  Deriving  this 
posterior  probability  indeed  means  that,  a  priori,  fi  is  negative  with  probability 
P7r(/i  <  0)  =  £/t)  and  that,  in  this  model,  the  prior  on  fi  is  the  truncated 

normal 


7ri(M) 


exp{~  Qu-02/2t2} 

\/27rr<?(— £/t) 


I 


fjL<  0  1 


while  /i  is  positive  with  probability  $(£/r)  and,  in  this  second  model,  the  prior 
on  fi  is  the  truncated  normal 


7T2  (fi) 


exp {-(/x  -  Q2/2 r2} 
\/2tt  t<?(£/t) 


The  posterior  probability  of  P7r(9Jt  =  k\S>n)  is  the  core  object  in  Bayesian 
model  choice  and,  as  indicated  above,  the  default  procedure  is  to  select  the 
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model  with  the  highest  posterior  probability.  However,  in  decisional  settings 
where  the  choice  between  two  models  has  different  consequences  depending  on 
the  value  of  fc,  the  boundary  in  P7r(fDt  =  k\@n)  between  choosing  one  model 
and  the  other  may  be  far  from  0.5.  For  instance,  in  a  pharmaceutical  trial, 
deciding  to  start  production  of  a  new  drug  does  not  have  the  same  financial 
impact  as  deciding  to  run  more  preliminary  tests.  Changing  the  bound  away 
from  0.5  is  in  fact  equivalent  to  changing  the  prior  probabilities  of  both  models. 


2.3.2  The  Bayes  Factor 


A  notion  central  to  Bayesian  model  choice  is  the  Bayes  factor 


B^n) 


P7r(9Jl  =  2|^n)/P7r  (9JI  =  l\@n) 
P7r(OT  =  2)/P7r(mt=  1)  ’ 


which  corresponds  to  the  classical  odds  or  likelihood  ratio,  the  difference  be¬ 
ing  that  the  parameters  are  integrated  rather  than  maximized  under  each 
model.  While  this  quantity  is  a  simple  one-to-one  transform  of  the  posterior 
probability,  it  can  be  used  for  Bayesian  model  choice  without  first  resorting 
to  a  determination  of  the  prior  weights  of  both  models.  Obviously,  the  Bayes 
factor  depends  on  prior  information  through  the  choice  of  the  model  priors  tti 
and  7T2, 

p7r  {fa  s  =  Ie2h{02\^n)^2{02)  de2  =  m2(@n) 

2l[  n)  ~  IeJi(0i\®n)Ki(0i)ad1  ~  mi(^)  ’ 

and  thus  it  can  clearly  be  perceived  as  a  Bayesian  likelihood  ratio  which 
replaces  the  likelihoods  with  the  marginals  under  both  models. 

The  evidence  brought  by  the  data  Q)n  can  be  calibrated  using  for  instance 
Jeffreys’  scale  of  evidence: 

if  log21(K^)  is  between  0  and  0.5,  the  evidence  against  model  M\ 
is  weak , 

if  it  is  between  0.5  and  1,  it  is  substantial , 
if  it  is  between  1  and  2,  it  is  strong ,  and 
if  it  is  above  2,  it  is  decisive. 

While  this  scale  is  purely  arbitrary,  it  provides  a  reference  for  model  assess¬ 
ment  in  a  generic  setting. 

Consider  now  the  special  case  when  we  want  to  assess  whether  or  not  a 
specific  value  of  one  of  the  parameters  is  appropriate,  for  instance  fi  =  0  in  the 
normaldata  example.  While  the  classical  literature  presents  this  problem  as 
a  point  null  hypothesis ,  we  simply  interpret  it  as  the  comparison  of  two  models, 
c/F( 0,  a2)  and  W(/r,  a2),  for  Illingworth’s  data.  In  a  more  general  framework, 
when  the  sample  S>n  is  distributed  as  S>n  ^  f(f2)n\d\  if  we  decompose  0  as 
6  =  (5,  oj)  and  if  the  restricted  model  corresponds  to  the  fixed  value  S  =  do,  we 
define  tt\(uS)  as  the  prior  under  the  restricted  model  (labelled  Mi)  and  7^(0) 
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as  the  prior  under  the  unrestricted  model  (labelled  M2).  The  corresponding 
Bayes  factor  is  then 


/^((«o,w)|®n)»r1Hdw 

Note  that,  as  hypotheses,  point  null  problems  often  are  criticized  as  ar¬ 
tificial  and  impossible  to  test  (in  the  sense  of  how  often  can  one  distinguish 
0  =  0  from  0  =  0.0001?/),  but,  from  a  model  choice  perspective,  they  simply 
correspond  to  more  parsimonious  models  whose  fit  to  the  data  can  be  checked 
against  the  fit  produced  by  an  unconstrained  model.  While  the  unconstrained 
model  obviously  contains  values  that  produce  a  better  fit,  averaging  over  the 
whole  parameter  space  O  may  still  result  in  a  small  integrated  likelihood 
^2(^n)*  The  Bayes  factor  thus  contains  an  automated  penalization  for  com¬ 
plexity,  a  feature  missed  by  the  classical  likelihood  ratio  statistic. 

^  In  the  very  special  case  when  the  whole  parameter  is  constrained  to  a  fixed 
value,  6  =  60,  the  marginal  likelihood  under  model  Mi  coincides  with  the 
likelihood  £(0o\@n)  =  f(@n\0o)  and  the  Bayes  factor  simplifies  in 

Je/(^n  |fl)ir2(fl)dfl 

2l{Jn)  mn\e0) 

For  x  ~  A /*(/i,cr2)  and  a2  known,  consider  assessing  fi  =  0  when  fi  ~ 
A/"(0,r2)  under  the  alternative  model  (labelled  M2).  The  Bayes  factor  is  the 
ratio 


B^n)  = 


m2(x) 


f(x\(0,a2)) 


a  e-xz /2(a2+T2) 


V a2  +  t‘ 


cr' 


a2  +  t* 


-x2 /2<j‘2 


t2x2 


exp 


2<j2(cr2  +  r2) 


Table  2.2  gives  a  sample  of  the  values  of  the  Bayes  factor  when  the  normalized 
quantity  x/ a  varies.  They  obviously  depend  on  the  choice  of  the  prior  variance 
r2  and  the  dependence  is  actually  quite  severe,  as  we  will  see  below  with  the 
J effreys-Lindley  paradox. 


For  normaldata,  since  we  saw  that  setting  a  to  the  Michelson-Morley 
value  of  0.75  was  producing  a  poor  outcome  compared  with  the  noninforma- 
tive  solution,  the  comparison  between  the  constrained  and  the  unconstrained 
models  is  not  very  trustworthy,  but  as  an  illustration,  it  gives  the  following 
values: 
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Table  2.2.  Bayes  factor  7>2i(z)  against  the  null  hypothesis  /jl  =  0  for  different 
values  of  z  =  x/a  and  r 


z 

0 

0.68 

1.28 

1.96 

2  2 

r  =  a 

0.707 

0.794 

1.065 

1.847 

t 2  =  10  a2 

0.302 

0.372 

0.635 

1.728 

>  BaFa=function(z,rat){ 

#rat  denotes  the  ratio  tau~2/sigma~2 
sqrt (1/ (1+rat)  )  *exp(z~2/  (2*  (1+1 /rat)  ) ) } 

>  BaFa(mean(shif t) , 1) 

[1]  0.7071767 

>  BaFa(mean(shif t) , 10) 

[1]  0.3015650 

which  supports  the  constraint  fi  =  0  for  those  two  values  of  r,  since  the  Bayes 
factor  is  less  than  1.  (For  this  dataset,  the  Bayes  factor  is  always  less  than 
one,  see  Exercise  2.13.) 


2.3.3  The  Ban  on  Improper  Priors 


We  introduced  noninformative  priors  in  Sect.  2.2.4  as  a  way  to  handle  situ¬ 
ations  when  the  prior  information  was  not  sufficient  to  build  proper  priors. 
We  also  saw  that,  for  normaldata,  a  noninformative  prior  was  able  to  ex¬ 
hibit  conflicts  between  the  prior  information  (based  on  the  Michelson-Morley 
experiment)  and  the  data  (resulting  from  Illingworth’s  experiment).  Unfor¬ 
tunately,  the  use  of  noninformative  priors  is  very  much  restricted  in  model 
choice  settings  because  the  fact  that  they  usually  are  improper  leads  to  the 
impossibility  of  comparing  the  resulting  marginal  likelihoods. 

Looking  at  the  expression  of  the  Bayes  factor, 


Bl x{9n) 


f02  ^2  (^2  \^n)^2  (^2)  d$2 

!01  h(0l\®n)ni(Ol)  <10!  ’ 


it  is  clear  that,  when  either  7Ti  or  7T2  are  improper,  it  is  impossible  to  normalize 
the  improper  measures  in  a  unique  manner.  Therefore,  the  Bayes  factor  be¬ 
comes  completely  arbitrary  since  it  can  be  multiplied  by  one  or  two  arbitrary 
constants. 

For  instance,  when  comparing  x  ^  W(/i,  1)  (model  Mi)  with  x  ^  mo,  i) 
(model  M2),  the  improper  Jeffreys  prior  on  model  Mi  is  7Ti (/i)  =  1.  The  Bayes 
factor  corresponding  to  this  choice  is 

e-x2/2  e~x2/2 

f+00  ix_g)y2  ^  ^ 

J  ~  OO 


BUx ) 
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If,  instead,  we  use  the  prior  7Ti(/x)  = 


100,  the  Bayes  factor  becomes 

x2/2  e~x2 /2 

-Or-0)72d<9  ~~  \mV2n 


and  is  thus  one- hundredth  of  the  previous  value!  Since  there  is  no  mathe¬ 
matical  way  to  discriminate  between  7Ti(/x)  =  1  and  7Ti(/i)  =  100,  the  answer 
clearly  is  non-sensical. 

Note  that,  if  we  are  instead  comparing  model  M\  where  fi  <  0  and  model 
M2  where  fi  >  0,  then  the  posterior  probability  of  model  M\  under  the  flat 
prior  is 


which  is  uniquely  defined. 

The  difficulty  in  using  an  improper  prior  also  relates  to  what  is  called  the 
Jeffrey s-Lindley  paradox ,  a  phenomenon  that  shows  that  limiting  arguments 
are  not  valid  in  testing  settings.  In  contrast  with  estimation  settings,  the  non- 
informative  prior  no  longer  corresponds  to  the  limit  of  conjugate  inferences. 
For  instance,  for  the  comparison  of  the  normal  x  ~  Nb(/r,  a2)  (model  Mi)  and 
of  the  normal  x  ~  NF(/i,cr2)  (model  M2)  models  when  a2  is  known,  using  a 
conjugate  prior  fi  ~  yK(0 ,r2),  the  Bayes  factor 


Bl 1  (*)  = 


G‘ 


a2  +  T" 


exp 


r2x2 


2<72(<T2  +  T2) 


converges  to  0  when  r  goes  to  +00,  for  every  value  of  x,  again  a  non-sensical 
procedure. 

Since  improper  priors  are  an  essential  part  of  the  Bayesian  approach,  there 
are  many  proposals  found  in  the  literature  to  overcome  this  ban.  Most  of 
those  proposals  rely  on  a  device  that  transforms  the  improper  prior  into  a 
proper  probability  distribution  by  exploiting  a  fraction  of  the  data  S>n  and 
then  restricts  itself  to  the  remaining  part  of  the  data  to  run  the  test  as  in 
a  standard  situation.  The  variety  of  available  solutions  is  due  to  the  many 
possibilities  of  removing  the  dependence  on  the  choice  of  the  portion  of  the 
data  used  in  the  first  step.  The  resulting  procedures  are  called  pseudo-Bayes 
factors ,  although  some  may  actually  correspond  to  true  Bayes  factors.  See 
Robert  (2007,  Chap.  5)  for  more  details,  although  we  do  not  advocate  using 
those  procedures. 

There  is  a  major  exception  to  this  ban  on  improper  priors  that  we  can 
exploit.  If  both  models  under  comparison  have  parameters  that  have  similar 
enough  meanings  to  share  the  same  prior  distribution,  as  for  instance  a  mea¬ 
surement  error  cr2,  then  the  normalization  issue  vanishes.  Note  that  we  are 
not  assuming  that  parameters  are  common  to  both  models  and  thus  that  we 
do  not  contradict  the  earlier  warning  about  different  parameters  to  different 
models.  An  illustration  is  provided  by  the  above  remark  on  the  comparison 
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of  /i  <  0  with  (i  >  0.  This  partial  opening  in  the  use  of  improper  priors  rep¬ 
resents  an  opportunity  but  it  does  not  apply  to  parameters  of  interest,  i.e.  to 
parameters  on  which  restrictions  are  assessed. 


Example  2.1.  When  comparing  two  id  normal  samples,  (aq, . . . ,  xn)  and  (y i, 
. . . ,  yn),  with  respective  distributions  M^/r^cr2)  and  W(/iy,cr2),  we  can  ex¬ 
amine  whether  or  not  the  two  means  are  identical,  i.e.  fix  =  /iy  (corresponding 
to  model  Mi).  To  take  advantage  of  the  structure  of  this  model,  we  can  assume 
that  a2  is  a  measurement  error  with  a  similar  meaning  under  both  models  and 
thus  that  the  same  prior  ^(cr2)  can  be  used  under  both  models.  This  means 
that  the  Bayes  factor 


B^n) 


f  h(/^x^y,  cr\^n)n(/ix,  /i^n^^a2)  da2  d /ix  d /iy 
f  ^i(/i,Cr|^n)7T/i(/i)7T(J(cr2)d(J2  d/I 


does  not  depend  on  the  normalizing  constant  used  for  ^(cr2)  and  thus  that 
we  can  still  use  an  improper  prior  such  as  ^(cr2)  =  l/cr2  in  that  case.  Fur¬ 
thermore,  we  can  rewrite  /ix  and  fiy  as  fix  =  fi  —  £  and  /iy  =  //  +  £,  respectively, 
and  use  a  prior  of  the  form  7r(/r,  £)  =  7rM(/i)7r^(<f)  on  the  new  parameterization 
so  that,  again,  the  same  prior  7rM  can  be  used  under  both  models.  The  same 
cancellation  of  the  normalizing  constant  occurs  for  7rM,  which  means  a  Jeffreys 
prior  7Ty(/i)  =  1  can  be  used.  However,  we  need  a  proper  and  well-defined  prior 
on  £,  for  instance  £  ~  W(0,  r2),  which  leads  to 


-n 


[(M  —  2T  +  (/i+£— y)2jrs^,y  /2(j2  2n— 2 


a 


e  ^  ^2t  /tV2tt  da2  dyd£, 


B^{&n)  = 


-n[(/j,-x)2  +  {y-y)-+s2y]/2<T2  2n-2  ^2 


(fj-  -  (,  -  x)2  +  (/U  +  ?  -  y)2  +  sly  e  ?  /2t  /rVzirdfidS, 


(/j.  -  x )2  +  (n-  y )2  +  s 2 


xy 


-n 


d/i 


where  s2y  denotes  the  average 

1  n  i  n 

4/  =  “  T  ( xi  ~  4  +  -  T  “  V )2  ' 

i— 1  i— 1 

While  the  denominator  can  be  completely  integrated  out,  the  numerator  can¬ 
not.  A  numerical  approximation  to  B is  thus  necessary.  (This  issue  is  ad¬ 
dressed  in  Sect.  2.4.)  ◄ 

We  conclude  this  section  by  a  full  processing  of  the  assessment  of  /i  =  0 
for  the  single  sample  normal  problem.  Comparing  models  M\  :  jV(0 ,cr2) 
under  the  prior  7Ti(<r2)  =  l/cr2  and  M2  :  M(/r,  cr2)  under  the  prior  made  of 
7r '2 (a 2)  =  l/cr2  and  ^(/ijcr2)  equal  to  the  normal  yK(0,a2)  density,  the  Bayes 
factor  is 
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BJl(^n)  = 


g-[n(a;-p)2+s2]/2cr2  e-/j2/2u2  ^.-^-1-2  d/xd(T 

_ 

e-[nS2  +  s2]/2o-2  CT-n-2d(72 


2  r_=2  ,  ,>  ,  .21  ,n_2  _  „  d//cl<T' 


e-(n+l)[/i-nx/(n+l)]  g-[nx  / (n+l)+s  ]/2crz  n  — 3 


V2 


7T 


—  9  9 

nx  +  8 


■n/2 


r(n/2) 


J (n  +  1)  1^2  e  /(n+1)+s  ]/2cr  a  n  2  da' 


-2  i  2  “I  —  n/2 

nx  +  5 


(n  +  1) 


-1/2 


nxz  /  (n  +  1)  +  5 


r(n/2) 

-n/2 


r(n/2) 


r  — 2  2 

nx  +  s 


■n/2 


r(n/2) 


(n  +  1) 


-1/2 


-2  i  2  “I  n/2 

nx  +  s 


nx2  /  (n  +  1)  +  s2 


taking  once  again  advantage  of  the  normalizing  constant  of  the  gamma  dis¬ 
tribution  (see  also  Exercise  2.8).  It  therefore  increases  to  infinity  with  x2/s2, 
starting  from  1  / y/n  +  1  when  x  =  0. 


The  value  of  this  Bayes  factor  for  Illingworth’s  data  is  given  by 

>  rat io=n*mean( shift) "2/ ( (n-1) *var (shift) ) 

>  ( (1+ratio) / (1+ratio/ (n+1) ) ) ~ (n/2) / sqrt (n+1) 

[1]  0.1466004 

which  confirms  the  assessment  that  the  model  with  /i  =  0  is  to  be  preferred. 


2.4  Monte  Carlo  Methods 

While,  as  seen  in  Sect.  2.3,  the  Bayes  factor  and  the  posterior  probability 
are  the  only  quantities  used  in  the  assessment  of  models  (and  hypotheses), 
the  analytical  derivation  of  those  objects  is  not  always  possible,  since  they 
involve  integrating  the  likelihood  £(0\@n)  both  on  the  sets  G\  and  @2,  under 
the  respective  priors  7Ti  and  712-  Fortunately,  there  exist  special  numerical 
techniques  for  the  computation  of  Bayes  factors,  which  are,  mathematically 
speaking,  simply  ratios  of  integrals.  We  now  detail  the  techniques  used  in  the 
approximation  of  intractable  integrals,  but  refer  to  Chen  et  al.  (2000)  and 
Robert  and  Casella  (2004,  2009)  for  book-length  presentations. 
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2.4.1  An  Approximation  Based  on  Simulations 

The  technique  that  is  most  commonly  used  for  integral  approximations  in 
statistics  is  called  the  Monte  Carlo  method6  and  relies  on  computer  simula¬ 
tions  of  random  variables  to  produce  an  approximation  technique  that  con¬ 
verges  with  the  number  of  simulations.  Its  justification  is  thus  the  law  of  large 
numbers ,  that  is,  if  x\ , ,xn  are  independent  and  distributed  from  g ,  then 
the  empirical  average 


3n  =  (h(x i)  +  . . .  +  H(xn))/N 
converges  (almost  surely)  to  the  integral 


h(x)g(x)  dx . 


We  will  not  expand  on  the  foundations  of  the  random  number  generators 
in  this  book,  except  for  an  introduction  to  accept-reject  methods  in  Chap.  5 
because  of  their  links  with  Markov  chain  Monte  Carlo  techniques  (see,  in¬ 
stead,  Robert  and  Casella,  2004).  The  connections  of  utmost  relevance  here 
are  (a)  that  softwares  like  R  can  produce  pseudo-random  series  that  are  indis¬ 
tinguishable  from  truly  random  series  with  a  given  distribution,  as  illustrated 
in  Table  1.1  and  (b)  that  those  software  packages  necessarily  cover  a  limited 
collection  of  distributions.  Therefore,  other  methods  must  be  found  for  simu¬ 
lating  distributions  outside  this  collection,  while  relying  on  the  distributions 
already  available,  first  and  foremost  the  uniform  ^  (0, 1)  distribution. 

The  implementation  of  the  Monte  Carlo  method  is  straightforward,  at  least 
on  a  formal  basis,  with  the  following  algorithmic  representation: 


Algorithm  2.1  Basic  Monte  Carlo  Method 

For  i  =  1, . . . ,  N, 

simulate  Xi  ~  g(x). 

Take 

Jtv  =  {h(x\ )  +  . . .  +  H(xn))/N 

to  approximate  3. 


as  long  as  the  (computer-generated)  pseudo-random  generation  from  g  is  feasi¬ 
ble  and  the  h(xi)  values  are  computable.  When  simulation  from  g  is  a  problem 
because  g  is  nonstandard  and  usual  techniques  such  as  accept-reject  algo¬ 
rithms  (see  Chap.  5)  are  difficult  to  devise,  more  advanced  techniques  such  as 
Markov  Chain  Monte  Carlo  (MCMC)  are  required.  We  will  introduce  those 

6This  method  is  named  in  reference  to  the  central  district  of  Monaco,  where  the 
famous  Monte-Carlo  casino  lies. 
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in  both  next  chapters.  When  the  difficulty  is  with  the  intractability  of  the 
function  h,  the  solution  is  often  to  use  an  integral  representation  of  h  and  to 
expand  the  random  variables  Xi  in  (a^,^),  where  yi  is  an  auxiliary  variable. 
The  use  of  such  representations  will  be  detailed  in  Chap.  6. 

Example  2.2  (Continuation  of  Example  2.1).  As  computed  in  Exam¬ 
ple  2.1,  the  Bayes  factor  B^Y&’n)  can  be  simplified  into 


^21  (-®n)  — 


/ 


x Y  +  (m  +  £  —  uY  +  s 


2 

xy 


—  n 


-C/2  r: 


d/j,  d^/rv27T 


J  [0  -  a;)2  +  (m  -  vf  +  s 


2 


-n 


d/i 


[(2£  +  a;  -  y)2  +  2  s 


2  ' 
xy. 


-n+l/2  e_e/2r. 


d£/r\/27r 


[(x  -  j/)2  +  2s 


2  '  —  ti~ (- 1/2 


and  we  are  left  with  a  single  integral  in  the  numerator  that  involves  the  normal 
yK(0 ,r2)  density  and  can  thus  be  represented  as  an  expectation  against  this 
distribution.  This  means  that  simulating  a  normal  W(0,t2)  sample  of  &s 
(i  =  1, ,  N )  and  replacing  B^-y  (S?n)  with 


^21  (~^n) 


~k  Sil  1  [(2^  +  ^  -  vY  +  2  S2^+2]  ^ 


2 

xy. 


[(x  i/)2  T  2  8 

is  an  asymptotically  valid  approximation  scheme 


— n+l/2 


In  normaldata,  if  we  compare  the  fifth  and  the  sixth  sessions,  both  with 
n  =  10  observations,  we  obtain 

>  illing=as . matrix (normaldata) 

>  xsam=illing filling [ , 1] ==5 , 2] 

>  xbar=mean(xsam) 

[1]  -0.041 

>  ysam=illing filling [ , 1] ==6 , 2] 

>  ybar=mean(ysam) 

[1]  -0.025 

>  Ssquar=9* (var (xsam)+var (ysam) )/10 
[1]  0.101474 

Picking  r  =  0.75  as  earlier,  we  get  the  following  approximation  to  the  Bayes 
factor 

>  Nsim=10~4 

>  tau=0.75 

>  xis=rnorm(Nsim, sd=tau) 

>  BaFa=mean( ( (2*xis+xbar-ybar) ~2+2*Ssquar) ~ (-8 . 5) ) / 

+  ( (xbar-ybar) ~2+2*Ssquar) ~ (-8 . 5) 

[1]  0.0763622 
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This  value  of  B^x  (3>n)  implies  that  £  =  0,  i.e.  jix  =  fiy  is  much  more  likely 
for  the  data  at  hand  than  /ix  ^  /iy.  Note  that,  if  we  use  r  =  0.1  instead,  the 
approximated  Bayes  factor  is  0.4985  which  slightly  reduces  the  argument  in 
favor  of  i±x  =  fiy. 

Obviously,  this  Monte  Carlo  estimate  of  3  is  not  exact,  but  generating  a 
sufficiently  large  number  of  random  variables  can  render  this  approximation 
error  arbitrarily  small  in  a  suitable  probabilistic  sense.  It  is  also  possible  to 
assess  the  size  of  this  error  for  a  given  number  of  simulations.  If 

J \h(x)\2g(x )  dx  <  oo  , 

the  central  limit  theorem  shows  that  [3n  —3}  is  also  normally  distributed, 
and  this  can  be  used  to  construct  asymptotic  confidence  regions  for  3jy,  esti¬ 
mating  the  asymptotic  variance  from  the  simulation  output. 

For  the  approximation  of  (@n)  proposed  above,  its  variability  is  illus¬ 
trated  in  Fig.  2.4,  based  on  500  replications  of  the  simulation  of  TV  =  1000 
normal  variables  used  in  the  approximation  and  obtained  as  follows 

>  xis=matr ix(rnorm (500* 1CT3, sd=tau) ,nrow=500) 

>  BF= ( (2*xis+xbar-ybar) ~2+2*Ssquar) ~ (-8 . 5)  / 

+  ( (xbar-ybar)  ~2+2*Ssquar)  ~  (-8 . 5) 

>  estims=apply (BF, 1 ,mean) 

>  hist (estims ,nclass=84,prob=T, col="wheat2" , 

+  main=" " ,xlab="Bayes  Factor  estimates") 

>  curve (dnorm(x,mean=mean (estims) , sd=sd(estims) ) , 

+  col="steelblue2" ,add=TRUE) 

As  can  be  seen  on  this  figure,  the  value  of  0.076  reported  in  the  previous 
Monte  Carlo  approximation  is  in  the  middle  of  the  range  of  possible  values. 
More  in  connection  with  the  above  point,  the  shape  of  the  histogram  is  clearly 
compatible  with  the  normal  approximation,  as  shown  by  the  fitted  normal 
density. 


2.4.2  Importance  Sampling 

An  important  feature  of  Example  2.2  is  that,  for  the  Monte  Carlo  approxima¬ 
tion  of  i(^n),  we  exhibited  a  normal  density  within  the  integral  and  hence 
derived  a  representation  of  this  integral  as  an  expectation  under  this  normal 
distribution.  This  seems  like  a  very  restrictive  constraint  in  the  approximation 
of  integrals  but  this  is  only  an  apparent  restriction  in  that  we  will  now  show 
that  there  is  no  need  to  simulate  directly  from  the  normal  density  and  fur¬ 
thermore  that  there  is  no  intrinsic  density  corresponding  to  a  given  integral, 
but  rather  an  infinity  of  densities! 
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Bayes  Factor  estimates 


FigM2.4.  Dataset  normaldata:  Histogram  of  500  realizations  of  the  approximation 
based  on  JV  =  1000  simulations  each  and  normal  fit  of  the  sample 


Indeed,  an  arbitrary  integral 


H(x)  dx 


can  be  represented  in  infinitely  many  ways  as  an  expectation,  since,  for  an 
arbitrary  probability  density  7,  we  always  have 

3=  [  — 7(x)dx,  (2.8) 

J  l\x) 

under  the  minimal  condition  that  j(x)  >  0  when  H(x).  Therefore,  the 
generation  of  a  sample  from  7  can  provide  a  converging  approximation  to 
(£  and  the  Monte  Carlo  method  applies  in  a  very  wide  generality.  This 
method  is  called  importance  sampling  when  applied  to  an  expectation  under  a 
density  g, 

3  =  J  h(x)g(x )  dx  ,  H(x)  =  h(x)g(x) 

since  the  values  07  simulated  from  7  are  weighted  by  the  importance  weights 
g{xi)h{xi)  in  the  approximation 
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/s 


J_  d(xi) 

N  7 (xi) 


£  While  the  representation  (2.8)  holds  for  any  density  7  with  a  support  larger  than 
the  support  of  if,  the  performance  of  the  empirical  average  3n  can  deteriorate 
considerably  when  the  ratio  h(x)g(x)/j(x)  is  not  bounded  as  this  raises  the 
possibility  for  infinite  variance  in  the  resulting  estimator.  When  using  importance 
sampling,  one  must  always  take  heed  of  a  potentially  infinite  variance  of  3n- 

An  additional  incentive  in  using  importance  sampling  is  that  this  method 
does  not  require  the  density  g  (or  7)  to  be  known  completely.  Those  densities 
can  be  known  only  up  to  a  normalizing  constant,  g(pc)  oc  g(pc)  and  j(x)  oc  7(2;), 
since  the  ratio 

n  /  n 

^h(xi)g(xi)/7(xi)  l'Y^g(xi)/Z((xi) 

2=1  '  2=1 

also  converges  to  3  when  n  goes  to  infinity  and  when  the  x^s  are  generated 
from  7. 

The  equivalent  of  Algorithm  2.1  for  importance  sampling  is  as  follows: 


Algorithm  2.2  Importance  Sampling  Method 

For  i  =  1, . . . ,  N, 

simulate  07  ~  q(x); 
compute  uii  =  g(xi)/j(xi) . 

Take 

N 

2=1 

to  approximate  3. 


This  algorithm  is  straightforward  to  implement.  Since  it  offers  a  degree  of 
freedom  in  the  selection  of  7,  simulation  from  a  manageable  distribution  can 
be  imposed,  keeping  in  mind  the  constraint  that  7  should  have  flatter  tails 
than  g.  Unfortunately,  as  the  dimension  of  x  increases,  differences  between 
the  target  density  g  and  the  importance  density  7  have  a  larger  and  larger 
impact. 

Example  2.3.  Consider  almost  the  same  setting  as  in  Exercise  2.11:  Q)n  = 
(oq, . . .  ,xn)  is  an  iid  sample  from  ^(0, 1)  and  the  prior  on  0  is  a  flat  prior. 
We  can  use  a  normal  importance  function  from  a  W^/qcr2)  distribution  to 
produce  a  sample  0i,...,0/v  that  approximates  the  Bayes  estimator  of  0, 
i.e.  its  posterior  mean,  by 
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5*(@n) 


Ef=i^exp{(^-M)2/2}  n?=i[l  +  (^ 

E^Li  exp  {(6»t  -  n)2/2}  nr=i  I1  +  - 


-1 


But  this  is  a  very  poor  estimation  (see  Exercise  2.17  for  an  analytic  explana¬ 
tion)  and  it  degrades  considerably  when  p  increases.  If  we  run  an  R  simulation 
experiment  producing  a  sample  of  estimates  when  p  increases,  as  follows, 


>  Nobs=10 

>  obs=rcauchy (Nobs) 

>  Nsim=250 

>  Nmc=500 

>  sampl=matrix(rnorm(Nsim*Nmc) ,nrow=1000)  #  normal  samples 

>  raga=riga=matrix(0 ,nrow=50 ,ncol=2)  #  ranges 

>  mu=0 

>  for  (j  in  1:50){ 

+  prod=l/dnorm(sampl-mu)  #  importance  sampling 
+  for  (i  in  l:Nobs) 

+  prod=prod*dt (obs [i] -sampl , 1) 

+  est i=apply ( sampl*prod , 2 , sum) /apply (prod , 2 , sum) 

+  raga[j  ,]  =range  (esti) 

+  riga [j ,] =c (quantile (esti , . 025) , quantile (esti , . 975) ) 

+  sampl=sampl+0 . 1 
+  mu=mu+0 . 1 
+  1 

>  mus=seq(0 ,4 . 9 ,by=0 . 1) 

>  plot (mus , 0*mus , col=" white" ,xlab=expression(mu) , 

+  ylab=expression (hat (theta) ) ,ylim=range (raga) ) 

>  polygon (c (mus , rev (mus) ) , c (raga [ , 1]  , rev (raga [ , 2]  ) ) , col="grey50") 

>  polygon (c (mus , rev (mus) ) , c (riga [ , 1] ,rev(riga [ , 2] ) ) , col="pink3" ) 

as  shown  by  Fig.  2.5,  not  only  does  the  range  of  the  approximation  increase, 
but  it  ends  up  missing  the  true  value  0  =  0  when  p  is  far  enough  from  0.  ◄ 


2.4.3  Approximation  of  Bayes  Factors 

Bayes  factors  being  ratios  of  integrals,  they  can  be  approximated  by  regu¬ 
lar  importance  sampling  tools.  However,  given  their  specificity  as  ratios  of 
marginal  likelihoods,  hence  of  normalizing  constants  of  the  posterior  distri¬ 
butions,  there  exist  more  specialized  techniques,  including  a  fairly  generic 
method  called  bridge  sampling ,  developed  by  Gelman  and  Meng  (1998). 

When  comparing  two  models  with  sampling  densities  (model 

Mi)  and  (model  M2),  assume  that  both  models  share  the  same  pa¬ 

rameter  space  O.  This  is  for  instance  the  case  when  comparing  the  fit  of  two 
densities  with  the  same  number  of  parameters  (modulo  a  potential  reparam¬ 
eterization  of  one  of  the  models).  In  this  setting,  if  the  corresponding  prior 
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Fig.  2.5.  Representation  of  the  whole  range  {grey)  and  of  the  95%  range  (pink) 
of  variation  of  the  importance  sampling  approximation  to  the  Bayes  estimate  for 
n  —  10  observations  from  the  ^(0, 1)  distribution  and  N  —  250  simulations  of  0 
from  a  ^(/r,  1)  distribution  as  a  function  of  p.  This  range  is  computed  using  500 
replications  of  the  importance  sampling  estimates 


densities  are  tti{0)  and  7^(0),  we  only  know  the  unnormalized  posterior  densi¬ 
ties  7Ti(6\@n)  =  fi{^n\0)7Ti{0)  and  iT2{0\S>n)  =  f2{^n\0)7T2{0).  In  this  general 
setting,  for  any  positive  function  a  such  that  the  integrals  below  exist,  the 
Bayes  factor  for  comparing  the  two  models  satisfies 


BU®n) 


mi{x) 

m2{x) 


mi(x)  / jti(ff\^n)a(9)Tr2(9\M)d9 
m2(a;)  [mo \S>n)a(9)^{9\^n)A9 

j  7Ti(9\M)0:(d)TT2  (0\@n)d0 
j  TT2^\^n)a{9)TTi  (9\@n)d9 


Therefore,  the  bridge  sampling  approximation 


N 


N 


y~Ti (92i\@n)cx(92i)  /  ^2M9ii\M)a(9u) 


i— 1 


i— 1 


(2.9) 


(2.10) 
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is  a  convergent  approximation  of  the  Bayes  factor  B^2(^n)  when  Oji  ~ 
7Tj(0\@n)  (j  =  1,  2,  i  =  1, . . . ,  TV).  One  of  the  appealing  features  of  the  method 
is  that  it  only  requires  simulations  from  the  posterior  distributions  under  both 
models  of  interest.  Another  interesting  feature  is  that  a  is  completely  arbi¬ 
trary,  which  means  it  can  be  chosen  in  the  best  possible  way.  Using  asymptotic 
variance  arguments,  Gelman  and  Meng  (1998)  proved  that  the  best  choice  is 

i 

(O')  oc  - 

1  ;  M0\@n)  +  M0\@n) 1 

which  bridges  both  posteriors.  This  means  that  the  optimal  weight  of  621  in 
(2.10)  is 


TTl  (&2i 

$n)  _  Ki{62i 

®n) 

TTl  (#2i 

-Sn)  +  7^2  ($2i 

S>n)  TTl{02i 

-Sn)  +  B^2(S?n)7r2(02i 

3>n) 

with  an  appropriate  change  of  indices  for  the  O^s.  There  is  however  a  caveat 
with  this  find  in  that  it  cannot  be  attained  because  the  optimum  depends  on 
the  very  quantity  we  are  trying  to  approximate!  However,  the  Bayes  factor 
B\2 (@n)  can  first  be  approximated  on  a  crude  basis  and  the  corresponding 
construction  of  a°  iterated  till  the  Bayes  factor  approximation  (2.10)  stabi¬ 
lizes. 

We  will  now  illustrate  this  derivation  in  the  case  of  the  normal  model,  with 
an  application  to  normaldata.  (We  showed  in  Sect.  2.3.3  that  the  Bayes  factor 
was  available  in  closed  form  so  this  implementation  of  the  bridge  sampler  is 
purely  for  illustrative  purposes.)  A  further  implementation  is  discussed  in 
Chap.  4,  Sect.  4.3.2,  in  connection  with  the  probit  model. 

When  assessing  whether  or  fi  =  0  is  appropriate  for  the  single  sample  nor¬ 
mal  model,  the  above  approximation  does  not  apply  directly  because  there  is 
an  extra  parameter  in  the  unconstrained  model.  There  are  however  two  easy 
tricks  out  of  this  difficulty.  The  first  one,  repeatedly  found  in  the  literature,  is 
to  add  an  arbitrary  density  to  make  dimensions  match.  In  the  normal  example, 
this  means  introducing  an  arbitrary  (normalized)  density  7r^(/i|cr2)  in  the  con¬ 
strained  model  (denoted  M\)  and  extending  the  Bayes  factor  representation 
(2.9)  to 


(-^n) 


j  7rJ  {n\a2):ki{t72\S!n)a(d)TT2(9\&n)<19 


7T2  (0|f^n)<a(0)7Ti  (cr2  |^n)dcr27T^  (/i|<J2)d/i 


which  holds  independently  of  7r^(/i|cr2)  for  the  same  reason  as  in  (2.9).  The 
choice  of  the  substitute  7r^(/i|cr2)  equal  to  an  approximation  of  7T2(/i|^n,  cr2)  is 
suggested  by  Chen  et  al.  (2000).  For  instance,  we  can  use  as  7r^(/i|cr2)  a  normal 
distribution  j2)  where  jl  and  a2  are  computed  based  on  a  simulation 

from  7T2 (/x,  cr\ &>n). 
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The  exact  value  of  this  Bayes  factor  {^n)  for  Illingworth’s  data  is  given 

by 

>  ( (1+ratio) / (1+ratio/ (n+1) ) ) ~ (-n/2) *sqrt (n+1) 

[1]  6.821262 

while  the  bridge  sampling  solution  is  obtained  as 

>  n=64 

>  xbar=mean (shift) 

>  sqar= (n-1) *var (shift) 

>  Nmc=l(T7 

>  #  Simulation  from  model  M2: 

>  sigma2=l/rgamma(Nmc ,  shape=n/2 , rate=  (n*xbar~2/  (n+1) +sqar) /2) 

>  mu2=rnorm(Nmc ,n*xbar/ (n+1) , sd=sqrt (sigma2/ (n+1) ) ) 

>  #  Simulation  from  model  Ml: 

>  s igmal=l/r gamma (Nmc , shape=n/2 ,rate=(n*xbar~2+sqar) /2) 

>  muhat=mean(mu2) 

>  tauat=sd(mu2) 

>  mul=rnorm(Nmc ,mean=muhat , sd=tauat) 

>  #tilde  functions 

>  tildepil=function(sigma,mu) { 

+  exp (- . 5* ( (n*xbar~2+sqar) /sigma+ (n+2) *log(sigma) ) + 

+  dnorm(mu,muhat ,tauat , log=T) ) 

+  } 

>  tildepi2=function(sigma,mu) { 

+  exp(-  .  5*  (  (n*  (xbar-mu)  ''2+sqar+mu'v2)  /sigma+  (n+3)  *log (sigma)  + 
+  log(2*pi)))} 

>  #Bayes  Factor  loop 

>  K=dif f =1 

>  rationum=tildepi2 (sigmal ,mul) /tildepil (sigmal ,mul) 

>  ratioden=tildepil (sigma2 ,mu2) /tildepi2 (sigma2 ,mu2) 

>  while  (dif f >0 . 01*K) { 

+  BF=mean(l/ (l+K*rationum) ) /mean(l/ (K+ratioden) ) 

+  dif f =abs (K-BF) 

+  K=BF> 

and  returns  the  value 

>  BF 

[1]  6.820955 

which  is  definitely  close  to  the  true  value! 

The  second  possible  trick  to  overcome  the  dimension  difficulty  while  using 
the  bridge  sampling  strategy  is  to  introduce  artificial  posterior  distributions 
in  each  of  the  parameters  spaces  and  to  process  each  marginal  likelihood  as 
an  integral  ratio  in  itself.  For  instance,  if  771  (0i)  is  an  arbitrary  normalized 
density  on  0 1,  and  a  is  an  arbitrary  function,  we  have 
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mi(9n) 


^i{0i\S>n)  d6>i 


j ^i{0i\^n)a{9i)rii(9i)  d^i 

J  m (Oi)a(di)ni (0i\@n)  d&i 


by  application  of  (2.9).  Therefore,  the  optimal  choice  of  a  leads  to  the  ap¬ 
proximation 


^  lfn  ,  _  T,f=l  /  {m1(@n)n1(0'li\@n)  +v(01i)} 

m  ]_  ( yi )  —  tv  j 

Ya=i  v(Qii)/  {mi(@n)n1(0ii\@n)  +T](0u)} 

when  On  ~  7Ti(0i|^n)  and  ff[{  ~  77(01).  The  choice  of  the  density  77  is  obvi¬ 
ously  fundamental  and  it  should  be  close  to  the  true  posterior  7Ti(0i|^n)  to 
guarantee  good  convergence  approximation.  Using  a  normal  approximation  to 
the  posterior  distribution  of  0  or  a  non-parametric  approximation  based  on 
a  sample  from  7Ti(0i|^n),  or  yet  again  an  average  of  MCMC  proposals  (see 
Chap.  4)  are  reasonable  choices. 


The  R  implementation  of  this  approach  can  be  done  as  follows 

>  sigmal=l/rgamma(Nmc , shape=n/2 ,rate= (n*xbar~2+sqar) /2) 

>  sihat=mean(log(sigmal) ) 

>  tahat=sd(log(sigmal) ) 

>  sigmalb=exp(rnorm(Nmc , sihat , tahat) ) 

>  #tilde  function 

>  tildepil=function(sigma) { 

exp (- . 5* ( (n*xbar~2+sqar) /sigma+ (n+2) *log (sigma) ) ) } 

>  K=dif f =1 

>  rnum=dnorm(log(sigmalb) , sihat , tahat) / 

+  (sigmalb+tildepil (sigmalb) ) 

>  rden=sigmal*tildepil (sigmal) /dnorm(log(sigmal) , sihat , tahat) 

>  while  (dif f >0 . 01*K) { 

>  BF=mean(l/ (1+K*rnum) ) /mean(l/ (K+rden)) 

>  dif f =abs (K-BF) 

>  K=BF> 

>  ml=BF 

when  using  a  normal  distribution  on  log(cr2)  as  an  approximation  to 
7Ti(0i|f^n).  When  considering  the  unconstrained  model,  a  bivariate  normal 
density  can  be  used,  as  in 

>  s igma2=l/r gamma (Nmc , shape=n/2 ,rate=(n*xbar~2/ (n+1) +sqar) /2) 

>  mu2=rnorm(Nmc ,n*xbar/ (n+1) , sd=sqrt (sigma2/ (n+1) ) ) 

>  t erne an= c (mean (mu2) ,mean(log(sigma2) ) ) 


2.4  Monte  Carlo  Methods 


57 


>  tevar=cov . wt (cbind(mu2 , log(sigma2) ) ) $cov 

>  te2b=rmnorm(Nmc ,mean=temean, tevar) 

>  mu2b=te2b  [ , 1] 

>  sigma2b=exp(te2b [,2] ) 

leading  to 

>  ml/m2 

[1]  6.824417 


The  performances  of  both  extensions  are  obviously  highly  dependent  on 
the  choice  of  the  completion  factors,  rji  and  r\2  on  the  one  hand  and  on  the 
other  hand.  The  performances  of  the  first  solution,  which  bridges  both  models 
via  7 r*,  are  bound  to  deteriorate  as  the  dimension  gap  between  those  models 
increases.  The  impact  of  the  dimension  of  the  models  is  less  keenly  felt  for  the 
other  solution,  as  the  approximation  remains  local. 

As  a  simple  illustration  of  the  performances  of  both  methods,  we  pro¬ 
duce  here  a  comparison  between  the  completions  based  on  a  single  pseudo¬ 
conditional  and  on  two  local  approximations  to  the  posteriors,  by  running 
repeated  approximations  for  normaldata  and  tracing  the  resulting  boxplot 
as  a  measure  of  the  variability  of  those  methods.  As  shown  in  Fig.  2.6,  the 
variability  is  quite  comparable  for  both  solutions  in  this  specific  case. 


Note  that  there  exist  many  other  approaches  to  the  approximative  com¬ 
putation  of  marginal  likelihoods  and  of  Bayes  factors  that  we  cannot  cover 
here.  We  want  however  to  point  out  the  dangers  of  the  harmonic  mean  ap¬ 
proximation.  This  approach  proceeds  from  the  interesting  identity 

[  <M#l)  ni(0l)tl(Ol\@n) 

J  ni(6i)£i(0i\ S>n)  mi{S>n)  1 

1 

which  holds,  no  matter  what  the  density  <pi(0i)  is — provided  cpi(0 1)  =  0  when 
tti(0iKi(0i| @n)  =  0 — .  The  most  common  implementation  in  approximations 
of  the  marginal  likelihood  uses  =  7Ti(0i),  leading  to  the  approximation 


E7"1 


6?j 

_TTi(6i)£i(9i\$n) 

-&n 

rhi(^n) 


1 

^l{Pij\S>n) 


While  very  tempting,  since  it  allows  for  a  direct  processing  of  simulations 
from  the  posterior  distribution,  this  approximation  is  unfortunately  most  often 
associated  with  an  infinite  variance  (Exercise  2.19)  and,  thus,  should  not  be 
used.  On  the  opposite,  using  yq’s  with  supports  constrained  to  the  25  %  HPD 
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single 


double 


Fig.  2.6.  Dataset  normaldata:  Boxplot  of  the  variability  of  the  approximations 
to  the  Bayes  factor  assessing  whether  or  not  p,  =  0,  based  on  a  single  and  on 
a  double  completions.  Each  approximation  is  based  on  105  simulations  and  the 
boxplots  are  based  on  250  approximations.  The  dotted  line  corresponds  to  the  true 
value  of  Bi2(&n) 


regions — approximated  by  the  convex  hull  of  the  10  %  or  of  the  25  %  highest 
simulations — is  both  completely  appropriate  and  implementable  (Marin  and 
Robert,  2010). 


2.5  Outlier  Detection 

The  above  description  of  inference  in  normal  models  is  only  an  introduction 
both  to  Bayesian  inference  and  to  normal  structures.  Needless  to  say,  there 
exists  a  much  wider  range  of  possible  applications.  For  instance,  we  will  meet 
the  normal  model  again  in  Chap.  4  as  the  original  case  of  the  (generalized) 
linear  model.  Before  that,  we  conclude  this  chapter  with  a  simple  extension 
of  interest,  the  detection  of  outliers. 

Since  normal  modeling  is  often  an  approximation  to  the  “real  thing,”  there 
may  be  doubts  about  its  adequacy.  As  already  mentioned  above,  we  will  deal 
later  with  the  problem  of  checking  that  the  normal  distribution  is  appropriate 
for  the  whole  dataset.  Here,  we  consider  the  somehow  simpler  problem  of  sep¬ 
arately  assessing  whether  or  not  each  point  in  the  dataset  is  compatible  with 
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normality.  There  are  many  different  ways  of  dealing  with  this  problem.  We 
choose  here  to  use  the  predictive  distribution :  If  an  observation  x\  is  unlikely 
under  the  predictive  distribution  based  on  the  other  observations ,  then  we 
can  argue  against  its  distribution  being  equal  to  the  distribution  of  the  other 
observations. 

If  xn+ 1  is  a  future  observation  from  the  same  distribution  f(-\6)  as  the 
sample  f^n,  its  predictive  distribution  given  the  current  sample  is  defined  as 


n(xn+i\2>n) 


f(Xn+ 1  \0,  9n)TT{9\9n)  d 9 


f(xn+1\9)ir(0\@n)d0. 


This  definition  is  coherent  with  the  Bayesian  approach,  which  considers  xn+i 
as  an  extra  unknown  and  then  integrates  out  6  if  xn+\  is  the  “parameter”  of 
interest. 

For  the  normal  W(/q  a2)  setup,  using  a  conjugate  prior  on  (/qcr2)  of  the 
form 

(a2)~Xa~3/2  exp  -  {X^p  -  02  +  a}  /2a2  , 
the  corresponding  posterior  distribution  on  (p,a2)  given  &n  is 


+  nxn  a- 


X^  +  n  Xa  +  n 


Xa  +  n/2, 


/< 


ex  T  s  T 


nX 


V  (— 


X^  +  n 


(x~C 


/2  , 


denoted  by 


^({(@n),a2/XM(0n))  x  jy{X„{®n)l2,a{®n)l2)  , 


and  the  predictive  on  xn+i  is  derived  as 


f*(xn+1\@n)  OC 


a<,(®„)/2  i  1  exp  —  (xn+1  —  p)2 /2a2 


x  exp  -  {A ai(^„)(m  -  £(^n))2  +  a(&n)}  /2a2d{p,a2) 

(X  j (fT2)-ACT(®„)/2-3/2  exp-  {(X^n)  +  l)(xn+1  -  £(^„))2 

/X^{9n)  +  a{3>n)} /2a2  da2 


OC 


a(®n)  +  A> xm  1  (Xn+1  ~  ^ n)) 2 


—  (  Acr  (^TT,  )  +  1 ) /2 


Therefore,  the  predictive  of  xn+i  given  the  sample  is  a  Student  t  distribu¬ 
tion  with  mean  £(f^n)  and  Xa(2#n)  degrees  of  freedom.  In  the  special  case  of  the 
noninformat ive  prior,  corresponding  to  the  limiting  values  AM  =  =  a  =  0, 

the  predictive  is 


r{xn+1\3>n)  OC 


+ 


n  +  1 


— (n+l)/2 


f(*^n+l  3Cn) 


n 


(2.H) 
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It  is  therefore  a  Student’s  t  distribution  with  n  degrees  of  freedom,  a  mean 
equal  to  xn  and  a  scale  factor  equal  to  (n  —  l)s2/n,  which  is  equivalent  to 
a  variance  equal  to  (n  —  1  )s2 /n2  (to  compare  with  the  maximum  likelihood 


estimator  a 


2  _ 


n 


'n). 


In  the  outlier  problem,  we  process  each  observation  Xi  E  @ n  as  if  if 
was  a  “future”  observation.  Namely,  we  consider  f£(x |f^)  as  being  the  pre¬ 
dictive  distribution  based  on  Q)ln  —  (aq, . . . ,  aq_i,  1, . . . ,  xn).  Considering 
f?(xi\K)  or  the  corresponding  cdf  Ff  (xi\^n)  (in  dimension  one)  gives  an 
indication  of  the  level  of  compatibility  of  the  observation  Xi  with  the  sample. 
To  quantify  this  level,  we  can,  for  instance,  approximate  the  distribution  of 
Fi  (xi \S?n)  as  being  uniform  over  [0,1]  since  Ff(- |f^)  does  converge  to  the 
true  cdf  of  the  model.  Simultaneously  checking  all  F*  (xi\^n)  over  i  may  signal 
outliers. 


^  The  detection  of  outliers  must  pay  attention  to  the  Bonferroni  fallacy,  which 
is  that  extreme  values  do  occur  in  large  enough  samples.  This  means  that,  as  n 
increases,  we  will  see  smaller  and  smaller  values  of  Ff  (xi |^)  occurring,  and 
this  even  when  the  whole  sample  is  from  the  same  distribution.  The  significance 
level  must  therefore  be  chosen  in  accordance  with  this  observation,  for  instance 


using  a  bound  a  on  Ff  (m, 


y 

n 


)  such  that 

1  —  (1  —  a)n  =  1 


a 


where  a  is  the  nominal  level  chosen  for  outlier  detection. 


Considering  normaldata,  we  can  compute  the  predictive  cdf  for  each  of 
the  64  observations,  considering  the  63  remaining  ones  as  data. 

>  n=length(shif t) 

>  outl=rep (0 ,n) 

>  for  (i  in  l:n){ 

+  lomean=-mean (shift [-i] ) 

+  losd=sd (shift [-i] ) *sqrt ( (n-2) /n) 

+  outl [i] =pt ( (shift  [i] -lomean) /losd,df =n-l) 

+  > 


Figure  2.7  provides  the  qq-plot  of  the  F? (xi  |^)’s  against  the  uniform  quan¬ 
tiles  and  compares  it  with  a  qq-plot  based  on  a  dataset  truly  simulated  from 
the  uniform  ^(0,1). 


>  plot (c (0 , 1) , c (0 , 1) , lwd=2 ,ylab="Predictive" ,xlab= "Uniform" , 
+  type="l") 

>  points ( (1 :n) / (n+1) , sort (outl) ,pch=19 , col="steelblue3") 

>  points ( (1 :n) / (n+1) , sort (runif (n) ) ,pch=19 , col="tomato") 


There  is  no  clear  departure  from  uniformity  when  looking  at  this  graph,  except 
of  course  for  the  multiple  values  found  in  normaldata. 
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Fig.  2.7.  Dataset  normaldata:  qq-plot  of  the  sample  of  the  F*  f°r  a  uniform 

(0, 1)  distribution  ( blue  dots )  and  comparison  with  a  qq-plot  for  a  uniform  (0, 1) 
sample  ( red  dots ) 


2.6  Exercises 


2.1  Show  that,  if 

/j\a2  ~  cr2 /\fj)  ,  a2  ~  J^f(ACT/2,  a/2)  , 


then 


fl  rs-2  ^  (  Act  ,  £,  a/ Ayu,  Act  ) 


a  £  distribution  with  A^-  degrees  of  freedom,  location  parameter  £  and  scale  parameter 

a  /  A^j,  Act  . 


2.2  Show  that,  if  cr  ~  y&(a,/3),  then  E[cr  ]  =  /?/(a 

of  J^(cq  /?)  that  the  mode  is  located  in  /3/(a  +  1). 


1).  Derive  from  the  density 


2.3  Show  that  minimizing  (in  6{S>n))  the  posterior  expectation  E7r[||6>-<9||2|^n 
duces  the  posterior  expectation  as  the  solution  in  0. 


pro- 


2.4  Show  that  the  Fisher  information  on  0  —  (//,  a2)  for  the  normal  cr2)  distri¬ 

bution  is  given  by 
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/1/a2  0 

V  o  l/2a4 

and  deduce  that  Jeffreys'  prior  is  i tj (0)  oc  1/a3. 


IF(0)  = 


1/a2  2(x  —  /r)/2a4 

2{pc  —  fi) / 2a4  (/i  —  x)2/a6  —  l/2a4 


2.5  Derive  each  line  of  Table  2.1  by  an  application  of  Bayes’  formula,  tt(6\x)  oc 
7r(0)f(x\0),  and  the  identification  of  the  standard  distributions. 

2.6  A  Weibull  distribution  W(a,f3, 7)  is  defined  as  the  power  transform  of  a  gamma 
£f(a,/?)  distribution:  If  x  ~  W(a,/3, 7),  then  x7  ~  Sf(a,/3).  Show  that,  when  7  is 
known,  y^(a, /3, 7)  allows  for  a  conjugate  family,  but  that  it  does  not  an  exponential 
family  when  7  is  unknown. 


2.7  Show  that,  when  the  prior  on  0  =  (/7  a2)  is  M/(/,a2/AM)  x  J^f(ACT,a),  the 
marginal  prior  on  /j,  is  a  Student  £  distribution  T(2A<r,  £,  a/A^A^)  (see  Example  2.18  for 
the  definition  of  a  Student  £  density).  Give  the  corresponding  marginal  prior  on  a2.  For 
an  iid  sample  Of n  =  (xi, . . . ,  xn)  from  7F(/i,  a2),  derive  the  parameters  of  the  posterior 
distribution  of  (/x,  a2). 


2.8  Show  that  the  normalizing  constant  for  a  Student  ^(z/,  /x,  a2)  distribution  is 

r((^  +  i)/2)//>/2) 

CT^/UTT 

Deduce  that  the  density  of  the  Student  £  distribution  (z/,  a2)  is 

(.  +  ^ 

ct^/istt  1  (z//2)  \  vcja 

2.9  Show  that,  for  location  and  scale  models,  the  specific  noninformative  priors  are 
special  cases  of  Jeffreys’ generic  prior,  i.e.,  that  ttj  (6)  =  land7rJ(6))  =  1/6,  respectively. 

2.10  Show  that,  when  7 r(6)  is  a  probability  density,  (2.5)  necessarily  holds  for  all 
datasets  k$n- 


\  -(^+l)/2 


2.11  Consider  a  dataset  from  the  Cauchy  distribution,  ^(/i,  1). 
1.  Show  that  the  likelihood  function  is 


=  Y[fAxi) 

i—1 


1 


2. 

3. 


4. 


Examine  whether  or  not  there  is  a  conjugate  prior  for  this  problem.  (The  answer  is 
no. ) 

Introducing  a  normal  prior  on  fi,  say  tF(0, 10),  show  that  the  posterior  distribution 
is  proportional  to 


7v(li\@n)  = 


exp(— /x2  /  20) 


nr=1(i  +  (^-M)2r 

Propose  a  numerical  solution  for  solving  7r(fjb\^n)  =  k.  (Hint:  A  simple  trapezoidal 
integration  can  be  used:  based  on  a  discretization  size  A,  computing  n(/j,\ @n)  on  a 
regular  grid  of  width  A  and  summing  up.) 
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2.12  Show  that  the  limit  of  the  posterior  probability  P7r(/i  <  0|x)  of  (2.7)  when  r 
goes  to  oo  is  @(—x/cr).  Show  that,  when  £  varies  in  R,  the  posterior  probability  can 
take  any  value  between  0  and  1. 


2.13  Define  a  function  BaRaJ  of  the  ratio  rat  when  z=mean( shift)  / .  75  in  the  func¬ 
tion  BaFa.  Deduce  from  a  plot  of  the  function  BaRaJ  that  the  Bayes  factor  is  always  less 
than  one  when  rat  varies.  (Note:  It  is  possible  to  establish  analytically  that  the  Bayes 
factor  is  maximal  and  equal  to  1  for  r  —  0.) 

2.14  In  the  application  part  of  Example  2.1  to  normaldata,  plot  the  approximated 
Bayes  factor  as  a  function  of  r.  (Hint:  Simulate  a  single  normal  tC(0, 1)  sample  and 
recycle  it  for  all  values  of  r.) 

2.15  In  the  setup  of  Example  2.1,  show  that,  when  £  ~  7C(0,cr2),  the  Bayes  factor 
can  be  expressed  in  closed  form  using  the  normalizing  constant  of  the  t  distribution  (see 
Exercise  2.8) 


2.16  Discuss  what  happens  to  the  importance  sampling  approximation  when  the  sup¬ 
port  of  g  is  larger  than  the  support  of  7. 


2.17  Show  that,  when  7  is  the  normal  tE( 0,  v I (v  —  2))  density  and  fu  is  the  density 
of  the  t  distribution  with  v  degrees  of  freedom,  the  ratio 


IM 

l{x) 


oc 


ex2(v-2)/2v 

[1  + 


does  not  have  a  finite  integral.  What  does  this  imply  about  the  variance  of  the  importance 
weights? 

Deduce  that  the  importance  weights  of  Example  2.3  have  infinite  variance. 


2.18  If  fu  denotes  the  density  of  the  Student  t  distribution  ^(zv,  0, 1)  (see  Exer¬ 
cise  2.8),  consider  the  integral 


fu(x)  dx . 


1.  Show  that  3  is  finite  but  that 


fu(x)  dx  =  00  . 


2.  Discuss  the  respective  merits  of  the  following  importance  functions  7 

-  the  density  of  the  Student  ^(z/,  0, 1)  distribution, 

-  the  density  of  the  Cauchy  ^(0, 1)  distribution, 

-  the  density  of  the  normal  tE(0,  v/(y  —  2))  distribution. 

In  particular,  show  via  an  R  simulation  experiment  that  these  different  choices  all 
lead  to  unreliable  estimates  of  3  and  deduce  that  the  three  corresponding  estimators 
have  infinite  variance. 

3.  Discuss  the  alternative  choice  of  a  gamma  distribution  folded  at  1,  that  is,  the 
distribution  of  x  symmetric  around  1  and  such  that 


x  —  1 


Qa(a ,  1) . 
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Show  that 


h(x)  ^  / X '}  oc  \fx  /2  (x)  1 1  —  x 
70) 


l-a-l 


exp 1 1  —  x 


is  integrable  around  x  —  1  when  a  <  1  but  not  at  infinity.  Run  a  simulation 
experiment  to  evaluate  the  performances  of  this  new  proposal. 

2.19  Evaluate  the  harmonic  mean  approximation 


m  i 


/  N  1 


when  applied  to  the  <yK(0,  a2)  model,  normaldata,  and  an  1)  prior  on  cr' 
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You  see,  I  always  keep  my  sums. 

— Ian  Rankin,  Strip  Jack. — 


Roadmap 

Linear  regression  is  one  of  the  most  widely  used  tools  in  statistics  for  analyzing  the 
(linear)  influence  of  some  variables  or  some  factors  on  others  and  thus  to  uncover 
explanatory  and  predictive  patterns.  This  chapter  details  the  Bayesian  analysis 
of  the  linear  (or  regression)  model  both  in  terms  of  prior  specification  (Zellner’s 
(T-prior)  and  in  terms  of  variable  selection,  the  next  chapter  appearing  as  a  sequel 
for  nonlinear  dependence  structures.  The  reader  should  be  warned  that,  given 
that  these  models  are  the  only  conditional  models  where  explicit  computation 
can  be  conducted,  this  chapter  contains  a  fair  amount  of  matrix  calculus.  The 
photograph  at  the  top  of  this  page  is  a  picture  of  processionary  caterpillars,  in 
connection  (for  once!)  with  the  benchmark  dataset  used  in  this  chapter. 


J.-M.  Marin  and  C.P.  Robert,  Bayesian  Essentials  with  R ,  Springer  Texts 
in  Statistics,  DOI  10. 1007/978- 1-4614-8687-9_3, 

©  Springer  Science+Business  Media  New  York  2014 
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3  Regression  and  Variable  Selection 


3.1  Linear  Models 

A  large  proportion  of  statistical  analyses  deal  with  the  representation  of 
dependences  among  several  observed  quantities.  For  instance,  which  social 
factors  influence  unemployment  duration  and  the  probability  of  finding  a 
new  job?  Which  economic  indicators  are  best  related  to  recession  occur¬ 
rences?  Which  physiological  levels  are  most  strongly  correlated  with  aneurysm 
strokes?  From  a  statistical  point  of  view,  the  ultimate  goal  of  these  analyses  is 
thus  to  find  a  proper  representation  of  the  conditional  distribution,  /(y|0,  x), 
of  an  observable  variable  y  given  a  vector  of  observables  x,  based  on  a  sample 
of  x  and  y.  While  the  overall  estimation  of  the  conditional  density  /  is  usually 
beyond  our  ability,  the  estimation  of  6  and  possibly  of  restricted  features  of  / 
is  possible  within  the  Bayesian  framework,  as  shown  in  this  chapter. 

The  variable  of  primary  interest,  y ,  is  called  the  response  or  the  out¬ 
come  variable;  we  assume  here  that  this  variable  is  continuous,  but  we 
will  completely  relax  this  assumption  in  the  next  chapter.  The  variables 
x  =  (xi are  called  explanatory  variables  and  may  be  discrete,  con¬ 
tinuous,  or  both.  One  sometimes  picks  a  single  variable  Xj  to  be  of  primary 
interest.  We  then  call  it  the  treatment  variable,  labeling  the  other  compo¬ 
nents  of  x  as  control  variables,  meaning  that  we  want  to  address  the  (linear) 
influence  of  x:)  on  y  once  the  linear  influence  of  all  the  other  variables  has 
been  taken  into  account  (as  in  medical  studies).  The  distribution  of  y  given 
x  is  typically  studied  in  the  context  of  a  set  of  units  or  experimental  sub¬ 
jects ,  i  =  1, . . . ,  n,  such  as  patients  in  a  hospital  ward,  on  which  both  yi  and 
xn, . . . ,  XiP  are  measured.  The  dataset  is  then  made  up  of  the  reunion  of  the 
vector  of  outcomes 

y  (ill  1  •  •  •  1  Vn) 

and  the  n  x  p  matrix  of  explanatory  variables 

X\\  X\2  •  •  •  X\ p 
X21  X22  •  •  •  X2p 
X3I  X32  •  •  •  X3 p 

Xfil  Xn2  •  •  •  Xnp  _ 


X 


xi 


X 


p 


The  caterpillar  dataset  exploited  in  this  chapter  was  extracted  from  a 
1973  study  on  pine  processionary  caterpillars:  it  assesses  the  influence  of  some 
forest  settlement  characteristics  on  the  development  of  caterpillar  colonies. 
This  dataset  was  first  published  and  studied  in  Tomassone  et  al.  (1993).  The 
response  variable  is  the  logarithmic  transform  of  the  average  number  of  nests 
of  caterpillars  per  tree  (as  the  one  in  the  picture  at  the  beginning  of  this 
chapter)  in  an  area  of  500  m1 2  (which  corresponds  to  the  last  column  in  cater¬ 
pillar).  There  are  p  =  8  potential  explanatory  variables  defined  on  n  =  33 
areas,  as  follows 

1  These  caterpillars  derive  their  name  from  their  habit  of  moving  over  the  ground 

in  incredibly  long  head-to-tail  monk-like  processions  when  leaving  their  nest  to  create 
a  new  colony. 
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x\  is  the  altitude  (in  meters), 

X2  is  the  slope  (in  degrees), 

X3  is  the  number  of  pine  trees  in  the  area, 

X4  is  the  height  (in  meters)  of  the  tree  sampled  at  the  center  of  the  area, 
£5  is  the  orientation  of  the  area  (from  1  if  southbound  to  2  otherwise), 
xq  is  the  height  (in  meters)  of  the  dominant  tree, 

X7  is  the  number  of  vegetation  strata, 

Xg  is  the  mix  settlement  index  (from  1  if  not  mixed  to  2  if  mixed). 

The  goal  of  the  regression  analysis  is  to  decide  which  explanatory  variables 
have  a  strong  influence  on  the  number  of  nests  and  how  these  influences  over¬ 
lap  with  one  another.  As  shown  by  Fig.  3.1,  some  of  these  variables  clearly 
have  a  restricting  influence  on  the  number  of  nests,  as  for  instance  with  £5,  X7 
and  xg.  We  use  the  following  R  code  to  produce  Fig.  3.1  (the  way  we  created 
the  objects  y  and  X  will  be  described  later). 

>  par (mf row=c (2,4) ,mar=c (4 . 2 ,2 , 2 , 1 . 2) ) 

>  for  (j  in  1:8)  plot (X [, j] ,y ,xlab=vnames [j] ,pch=19, 

+  col="sienna4" ,xaxt="n" ,yaxt="n") 


While  many  models  and  thus  many  dependence  structures  can  be  proposed 
for  dependent  datasets  like  caterpillar,  in  this  chapter  we  only  focus  on  the 
Gaussian  linear  regression  model,  namely  the  case  when  E[y|x,  0]  is  linear  in 
x  and  the  noise  is  normal. 

The  ordinary  normal  linear  regression  model  is  such  that,  using  a  matrix 
representation, 


y\a,  (3,  a2  ~  jVn  (aln  +  X/3,  a2  In) 


where  denotes  the  normal  distribution  in  dimension  n,  and  thus  the  yi  s 
are  independent  normal  random  variables  with 


E[yi\a,/3,a2]  =  a  +  /31xii  +  ...  +  /3pxip,  Y[yi\a,  f3,  a2]  =  a 


2i 


Given  that  the  models  studied  in  this  chapter  are  all  conditional  on  the  re¬ 
gressors,  we  omit  the  conditioning  on  X  to  simplify  the  notations. 


For  caterpillar,  where  n  =  33  and  p  =  8,  we  thus  assume  that  the  ex¬ 
pected  lognumber  yi  of  caterpillar  nests  per  tree  over  an  area  is  modeled  as  a 
linear  combination  of  an  intercept  and  eight  predictor  variables  (i  =  1, . . . ,  n), 


E[2/i|a,  /3,  a2 


8 

=  a  Pjxij  ? 

3= 1 
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Fig.  3.1. 

Dataset  caterpillar 

:  Plot  of  the  pairs  (x^,  y)  (1  <  j  <  8) 

while  the  variation  around  this  expectation  is  supposed  to  be  normally 
distributed.  Note  that  it  is  also  customary  to  assume  that  the  yf s  are 
conditionally  independent. 

The  caterpillar  dataset  is  called  by  the  command  data  (caterpillar) 
and  is  made  of  the  following  rows: 

1200  22  1  4  1.1  5.9  1.4  1. 42. 37 
1342  28  8  4.4  1.5  6.4  1.7  1.7  1.47 
•  •  •  • 

1229  21  11  5.8  1.8  10  2.3  2  0.21 
1310  36  17  5.2  1.9  10.3  2.6  2  0.03 

The  first  eight  columns  correspond  to  the  explanatory  variables  and  the  last 
column  is  the  response  variable,  i.e.  the  lognumber  of  caterpillar  nests.  The 
following  R  code  is  an  example  for  starting  with  this  caterpillar  dataset: 


3.2  Classical  Least  Squares  Estimator 
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>  y=log(caterpillar$y) 

>  X=as .matrix (caterpillar [, 1 : 8]  ) 


There  is  a  difference  between  using  finite- valued  regressors  like  xj  in  cater¬ 
pillar  and  using  categorical  variables  (or  factors ),  which  also  take  a  finite  num¬ 
ber  of  values  but  whose  range  has  no  numerical  meaning.  For  instance,  if 
x  denotes  the  socio-professional  category  of  an  employee,  this  variable  may 
range  from  1  to  9  for  a  rough  grid  of  socio-professional  activities,  or  it  may 
range  from  1  to  89  on  a  finer  grid,  and  the  numerical  values  are  not  compara¬ 
ble.  It  thus  makes  little  sense  to  involve  x  directly  in  the  regression,  and  the 
usual  approach  is  to  replace  the  single  regressor  x  (taking  values  in  {1, ... ,  m}, 
say)  with  m  indicator  (or  dummy )  variables  x\  =  Ii(x),  . . .,  xm  =  Im(x).  In 
essence,  a  different  constant  (or  intercept )  /3j  is  used  in  the  regression  for  each 
class  of  categorical  variable:  it  is  invoked  in  the  linear  regression  under  the 
form 

•  •  •  +  (3\\\{x)  +  . . .  T  /3mIm(x)  +  . . .  . 

Note  that  there  is  an  identifiability  issue  related  with  this  model  since  the  sum 
of  the  indicators  is  always  equal  to  one.  In  a  Bayesian  perspective,  identifia- 
bility  can  be  achieved  via  the  prior  distribution.  However,  we  can  also  impose 
an  identifiability  constraint  on  the  parameters,  for  instance  by  omitting  one 
class  (such  as  f3\  =0).  We  pursue  this  direction  further  in  Sects.  4.5.1  and  6.2. 


3.2  Classical  Least  Squares  Estimator 


Before  fully  launching  into  the  description  of  the  Bayesian  approach  to  the 
linear  model,  we  recall  the  basics  of  the  classical  processing  of  this  model 
(in  particular,  to  relate  the  Bayesian  perspective  to  the  results  provided  by 
standard  software  such  as  R  Im  output).  For  instance,  the  parameter  (3  can 
obviously  be  estimated  via  maximum  likelihood  estimation.  In  order  to  avoid 
non-identifiability  and  uniqueness  problems,  we  assume  that  [ln  X]  is  of  full 
rank,  that  is,  rank  [ln  X]  =  p+1.  This  also  means  that  there  is  no  redundant 
structure  among  the  explanatory  variables.2  We  suppose  in  addition  that  p  + 
1  <  n  in  order  to  obtain  well-defined  estimates  for  all  parameters.  Notice  that, 
since  the  inferential  process  is  conditioned  on  the  design  matrix  X,  we  choose 
to  standardize  the  data,  namely  to  center  and  to  scale  the  columns  of  X  so 
that  the  estimated  values  of  (3  are  truly  comparable.  For  this  purpose,  we  use 
the  R  function  scale: 

>  X=scale (X) 


2 


Hence,  the  exclusion  of  one  of  the  classes  for  categorical  variables. 
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The  likelihood  £(a,  /3,  cr2|y)  of  the  standard  normal  linear  model  is  pro¬ 
vided  by  the  following  matrix  representation: 


1 


(27T<T2) 


2  W2 


exp 


(y  -  aln  -  X/3)T  (y  -  aln  -  X/3) 


2ct2 


(3.1) 


The  maximum  likelihood  estimators  of  a  and  (3  are  then  the  solution  of  the 
(least  squares)  minimization  problem 


min  (y  -  aln  -  X/3)T  (y  -  a  1„  -  X/3) 

a,/3 


n 


=  min  V  (yi  -  a  -  j3iXn  -  ...  -  j3pxip)2  , 

ot,P 

i=l 


If  we  denote  by  y 


1 

n 


n 

Hi  the  empirical  mean  of  the  yi  s  and  recall  that, 


i= 1 

l^X  =  0^  because  of  the  standardization  step,  we  have  a  Pythagorean  dec¬ 
omposition  of  the  above  norm  as 


(y-al„-X/3)T  (y— aln— X/3) 

=  (y-yln-X/3+(y-a)l„)T  (y— yl„— X/3+(y— a)ln) 

=  (y-yln-X/3)T  (y— yl„— X/3)  +2(y-a)l£  (y-yl„-X/3)  +n(y-a)2 
=  (y-yln-X/3)T  (y— yl«— X/3)  +n(y-a)2  . 


Indeed,  1^  (y  —  yln  —  X/3)  =  (ny  —  ny)  =  0.  Therefore,  the  likelihood 
^(a, /3,  cr2|y)  is  given  by 

(27rg2))i/2  6XP  {-^tAy-nu-X^)T  {y-yln-X^  exp  {-^(y-a)2}  . 

We  get  from  the  above  decomposition  that 

d  =  y,  p  =  (XTX)-1XT(y  -  y) . 

/\ 

In  geometrical  terms,  (d,  /3)  is  the  orthogonal  projection  of  y  on  the  linear 
subspace  spanned  by  the  columns  of  [ln  X].  It  is  quite  simple  to  check  that 
(d,/3)  is  an  unbiased  estimator  of  (a, /3).  In  fact,  the  Gauss-Markov  theorem 
(see,  e.g.,  Christensen,  2002)  states  that  (d,/3)  is  the  best  linear  unbiased 
estimator  of  (a,/?).  This  means  that,  for  all  a  E  Mp+1,  and  with  the  abuse  of 
notation  that,  here,  (d,  /?)  represents  a  column  vector, 


V(aT(d, /3)|a, /3,  cr2)  <  V(aT(d,  ^)|a, /3,  cr2) 


for  any  unbiased  linear  estimator  (d,/3)  of  (a,/?).  (Note  that  the  property  of 
unbiasedness  is  not  particularly  appealing  when  considered  on  its  own.) 
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An  unbiased  estimator  of  cr2  is 


1 


n  —  p 


-  (y  -  aln 


Xp)T(y  -aln-X$) 


s 


2 


n  —  p 


1 


and  <r2(XTX)_1  approximates  the  covariance  matrix  of  /3.  Note  that  the  MLE 
of  a 2  is  not  <t2  but  a2  =  s2 /n. 

The  standard  t-statistics  are  defined  as  (j  =  1, . . .  ,p) 

Tj  =  ~  -  p  -  1,0,1), 

Va  ujj 

where  Wjj  denotes  the  (j,  j)-th  element  of  the  matrix  (XTX)‘  1.  These  t- 
statistic  are  used  in  classical  tests,  for  instance  for  testing  Hq  :  fij  =  0  versus 
Hi  :  /3j  0,  the  former  being  accepted  at  level  7  if 


Pj \/°y/Ujj  <  K-p-li1  -  7/2) 


the  (1  —  7/2)th  quantile  of  the  Student’s  t  3?(n  —  p—  1,0,1)  distribution 
(with  location  parameter  0  and  scale  parameter  1).  The  frequentist  argument 
in  using  this  bound  (see  Casella  and  Berger,  2001)  is  that  the  so-called  p-value 
is  smaller  than  7, 


Pj  =  PH0(\Tj 


> 


tj  I)  <  7- 


Note  that  these  statistics  Tj  can  also  be  used  when  constructing  marginal 
frequentist  confidence  intervals  on  the  /3j’ s  like 


*3  5 


Pj  ~  P 3 


<  F„7-i(  1  -  7/2)}  =  {ft;  | ft  |  <  a^/uJjj  F„ip_i(  1  -  7/2)}  . 


^  From  a  Bayesian  perspective,  we  far  from  advocate  the  use  of  p-values  in 
Bayesian  settings  or  elsewhere  since  they  suffer  many  defects  (exposed  for  in¬ 
stance  in  Robert,  2007,  Chap.  5),  one  being  that  they  are  often  wrongly  inter¬ 
preted  as  probabilities  of  the  null  hypotheses. 


For  caterpillar,  the  unbiased  estimate  of  <r2  is  equal  to  0.7781  and  the 
maximum  likelihood  estimates  of  a  and  of  the  components  Pj  produced  by 
the  R  command 

>  summary (lm(y~X) ) 

are  given  in  Fig.  3.2,  along  with  the  least  squares  estimates  of  their  respective 
standard  deviations  and  p-values.  According  to  the  classical  paradigm,  the 
coefficients  P\ ,  p2  and  Py  are  the  only  ones  considered  to  be  significant. 
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We  stress  here  that  conditioning  on  X  is  valid  only  when  X  is  exogenous , 
that  is,  only  when  we  can  write  the  joint  distribution  of  (y,  X)  as 


/( y,  X|a,  (3,  a2, 5)  =  f(y\a,  f3,  a2,  X)/(X|<5) , 


where  (a,  /3,  a2)  and  S  are  fixed  parameters.  We  can  thus  ignore  /(X|5)  if  the 
parameter  S  is  only  a  nuisance  parameter  since  this  part  is  independent3  of 
(a,  /3,  a2).  The  practical  advantage  of  using  a  regression  model  as  above  is  that 
it  is  much  easier  to  specify  a  realistic  conditional  distribution  of  one  variable 
given  p  others  rather  than  a  joint  distribution  on  all  p  +  1  variables.  Note 
that  if  X  is  not  exogenous ,  for  instance  when  X  involves  past  values  of  y  (see 
Chap.  7),  the  joint  distribution  must  be  used  instead. 


Residuals : 


Min 

IQ  Median 

3Q 

Max 

-1.4710  -0. 

4474  -0.1769 

0.6121 

1.5602 

lm (formula 

X 

* 

ii 

Residuals : 

Min 

IQ  Median 

3Q 

Max 

-1.4710  -0. 

4474  -0.1769 

0.6121 

1.5602 

Coefficients : 

Estimate  Std.  Error 

t  value 

PrOltl) 

(Intercept) 

-0.81328 

0.15356 

-5.296 

1.97e-05 

*** 

Xxl 

-0.52722 

0.21186 

-2.489 

0.0202 

* 

Xx2 

-0.39286 

0.16974 

-2.315 

0.0295 

* 

Xx3 

0.65133 

0.38670 

1.684 

0.1051 

Xx4 

-0.29048 

0.31551 

-0.921 

0.3664 

Xx5 

-0.21645 

0.16865 

-1.283 

0.2116 

Xx6 

0.29361 

0.53562 

0.548 

0.5886 

Xx7 

-1.09027 

0.47020 

-2.319 

0.0292 

* 

Xx8 

-0.02312 

0.17225 

-0.134 

0 . 8944 

Signif.  codes:  0  ?***?  0.001  ?**?  0 . 01  ?*?  0 . 05  ? . ?  0 . 1  ?  ?  1 

Residual  standard  error:  0.8821  on  24  degrees  of  freedom 
Multiple  R-squared:  0 . 6234, Adjusted  R-squared:  0.4979 

Fig.  3.2.  Dataset  caterpillar:  R  output  providing  the  least  squares  estimates  of 
the  regression  coefficients  along  with  their  standard  significance  analysis 


3From  a  Bayesian  point  of  view,  note  that  we  would  also  need  to  impose  prior 
independence  between  (<a,/3,cr2)  and  5  to  achieve  this  separation. 


3.3  The  Jeffreys  Prior  Analysis 


73 


3.3  The  Jeffreys  Prior  Analysis 

Considering  only  the  case  of  a  complete  lack  of  prior  information  on  the  pa¬ 
rameters  of  the  linear  model,  we  first  describe  a  noninformative  solution  based 
on  the  Jeffreys  prior.  It  is  rather  easy  to  show  that  the  Jeffreys  prior  in  this 
case  is 

7rJ(a,  /3,  a2)  oc  a~2  , 

which  is  equivalent  to  a  flat  prior  on  (a,  /3,  log  a2).  We  recall  that 
((a,  (3,  cr2  |y)  =  1  ^ 


(27rcr2)ra/2  l  2cr2 

n  (y-«)2} 


exp 


(y  -  yin  -  X/3)t  (y  -  yl„  -  X/3)  }  x 


exp 


2  a2 


1 


1 


2\n/2 


exp 


2cr2 


y  -  dii 


Xpf[y-al 


(27 rcr2) 

expi-^(a-a)2  . 


X/3  x 


The  corresponding  posterior  distribution  is  therefore 


7r'/(a, /3,  cr2|y)  oc  (cr  2)  ™/zexp 


—  2\  —n/2 


1 


2cr2 


(y  —  <lln  —  X/3)T(y  —  aln  —  X/3) 


X 


a  exp 


.-2  \~P/2 


2^2  (a  -  a)2  -  Tj 09  -  /9)TXTX(/3 


-/3) 


oc  (cr  2)  '  exp 
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(/3  -  /3)tXtX(/3  -  /3)  x 
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2\-l/2 


2cr2 
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exp 


2cr2 
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exp 


2  cr2 


From  this  expression,  we  deduce  the  following  (conditional  and  marginal) 
posterior  distributions 


a 


<r2,y  ~  jY  (a,a2/n)  , 


/3|ct2 , y  ~  jVv  (/3,u2(XtX) 


cr 


y  ~  ((n  —  p  —  l)/2,82/2) 


^  As  in  every  analysis  involving  an  improper  prior,  one  needs  to  check  that  the 
corresponding  posterior  distribution  is  proper.  In  this  case,  i r(o,  /3,o-2|y)  is 
proper  when  both  n  >  p  +  1  and  rank  [ln  X]  =  p  +  1.  The  former  constraint 
requires  that  there  be  at  least  as  many  data  points  as  there  are  parameters  in 
the  model,  and,  as  already  explained  above,  the  latter  is  obviously  necessary 
for  identifiability  reasons. 
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The  corresponding  Bayesian  estimates  of  a,  / 3  and  a2  are  thus  given  by 


E^ojy]  =  A,  E77  [/3\y]  =  (3  and  E^cr^y] 


n  —  p  —  3 


respectively.  Unsurprisingly,  the  Jeffreys  prior  estimate  of  a  is  the  empirical 
mean.  Further,  the  posterior  expectation  of  /3  is  the  maximum  likelihood 
estimate.  Note  also  that  the  Jeffreys  prior  estimate  of  a2  is  larger  (and  thus 
more  pessimistic)  than  both  the  maximum  likelihood  estimate  s2  jn  and  the 
classical  unbiased  estimate  s2  /  (n  —  p  —  1). 

The  marginal  posterior  distribution  of  /3j  associated  with  the  above  joint 
distribution  is 

^{n-p-  1  ,Pj,u)jjs2/(n-p  -  1)), 

(recall  that  lo:u  =  (XTX)7.g).  Hence,  the  similarity  with  a  frequentist  anal- 
ysis  of  this  model  is  very  strong  since  the  classical  (1  —  7)  confidence  interval 
and  the  Bayesian  HPD  interval  on  /3j  coincide,  even  though  they  have  different 
interpretations.  They  are  both  equal  to 


Pj  |  <  Fn_ p_i(l  7/2) 


-v 


For  caterpillar,  the  Bayes  estimate  of  a2  is  equal  to  0.8489.  Figure  3.3 
provides  the  corresponding  (marginal)  95  %  HPD  intervals  for  each  component 
of  p.  (It  is  obtained  by  the  plotCI  function,  part  of  the  gplots  package.) 
Note  that  while  some  of  these  credible  intervals  include  the  value  Pj  =  0 
(represented  by  the  dashed  line),  they  do  not  necessarily  validate  acceptance  of 
the  null  hypothesis  Hq  :  pj  =  0,  which  must  be  tested  through  a  Bayes  factor, 
as  described  below.  This  distinction  is  a  major  difference  from  the  classical 
approach,  where  confidence  intervals  are  dual  sets  of  acceptance  regions. 


3.4  Zellner’s  G-Prior  Analysis 

From  this  section  onwards,4  we  concentrate  on  a  different  noninformative 
approach  which  was  proposed  by  Arnold  Zellner5  to  handle  linear  regression 
from  a  Bayesian  perspective.  This  approach  is  a  middle-ground  perspective 
where  some  prior  information  may  be  available  on  /3  and  it  is  called  Zellner’s 
G-prior ,  the  “G”  being  the  symbol  used  by  Zellner  in  the  prior  variance. 

4In  order  to  keep  this  coverage  of  G-priors  simple  and  self-contained,  we  made 
several  choices  in  the  presentation  that  the  most  mature  readers  will  possibly  find 
arbitrary,  but  this  cannot  be  avoided  if  we  want  to  keep  the  chapter  at  a  reasonable 
length. 

5  Arnold  Zellner  was  a  famous  Bayesian  econometrician,  who  wrote  two  reference 
books  on  Bayesian  econometrics  (Zellner,  1971,  1984) 
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Fig.  3.3.  Dataset  caterpillar:  Range  of  the  credible  95  %  HPD  intervals  for  a  ( top 
row )  and  each  component  of  f3  when  using  the  Jeffreys  prior 


3.4.1  A  Semi-noninformative  Solution 

When  considering  the  likelihood  (3.1)  its  shape  is  both  Gaussian  and  Inver¬ 
se  Gamma,  indeed,  /3  given  a2  appears  in  a  Gaussian-like  expression,  while 
a2  involves  an  Inverse  Gamma  expression.  This  structure  leads  to  a  natural 
conjugate  prior  family,  of  the  form 

(a,  f3)\a2  ~  ,y¥p+1((a,  (3),  a2M~r) , 

conditional  on  cr2,  where  M  is  a  (p  +  l,p  +  1)  positive  definite  symmetric 
matrix,  and  for  cr2, 

cr2  ~  6),  a,  b  >  0  . 

(The  conjugacy  can  be  easily  checked  by  the  reader.)  Even  in  the  presence 
of  genuine  information  on  the  parameters,  the  hyperparameters  M,  a  and  b 
are  very  difficult  to  specify.  Moreover,  the  posterior  distributions,  notably  the 
posterior  variances  are  sensitive  to  the  specification  of  these  hyper-parameters. 
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Therefore,  given  that  a  natural  conjugate  prior  for  the  linear  regression 
model  has  severe  limitations,  a  more  elaborate  strategy  is  called  for.  The 
idea  at  the  core  of  Zellner’s  G-prior  modeling  is  to  allow  the  experimenter 
to  introduce  (possibly  weak)  information  about  the  location  parameter  of  the 
regression  but  to  bypass  the  most  difficult  aspects  of  the  prior  specification, 
namely  the  derivation  of  the  prior  correlation  structure.  This  structure  is  fixed 
in  Zellner’s  proposal  since  the  prior  corresponds  to 

p\a,  <J2  ~  *Ap  (p,  ga2(XTX)-^  ,  (3.2) 

and  a  noninformative  prior  distribution  is  imposed  on  the  pair  (cpcr2), 

7 r  (a,  cr2)  oc  a~2  .  (3-3) 

Zellner’s  G-prior  is  thus  decomposed  as  a  (conditional)  Gaussian  prior  for  /3 
and  an  improper  (Jeffreys)  prior  for  (a,  a2).  This  modelling  somehow  appears 
as  a  data-dependent  prior  through  its  dependence  on  X,  but  this  is  not  a 
genuine  issue6  since  the  whole  model  is  conditional  on  X.  The  experimenter 
thus  restricts  prior  determination  to  the  choices  of  /3  and  of  the  constant  g. 
As  we  will  see  once  the  posterior  distribution  is  constructed,  the  factor  g  can 
be  interpreted  as  being  inversely  proportional  to  the  amount  of  information 
available  in  the  prior  relative  to  the  sample.  For  instance,  setting  g  =  n  gives 
the  prior  the  same  weight  as  one  observation  of  the  sample.  We  will  use  this 
as  our  default  value. 

^  Genuine  data-dependent  priors  are  not  acceptable  in  a  Bayesian  analysis  because 
they  use  the  data  twice  and  fail  to  enjoy  the  basic  convergence  properties  of 
the  Bayes  estimators.  (See  Carlin  and  Louis,  1996,  for  a  comparative  study  of 
the  corresponding  so-called  empirical  Bayes  estimators.) 

Note  that,  in  the  initial  proposition  of  Zellner  (1984),  the  parameter  a  is 
not  modelled  by  a  flat  prior  distribution.  It  was  instead  considered  to  be  a 
component  of  the  vector  (3.  (This  was  also  the  approach  adopted  in  Marin  and 
Robert  2007.)  However,  endowing  a  with  a  flat  prior  ensures  the  location-scale 
invariance  of  the  analysis,  which  means  that  changes  in  location  or  scale  on 
y  (like  a  switch  from  Celsius  to  Fahrenheit  degrees  for  temperatures)  do  not 
impact  on  the  resulting  inference. 

We  are  now  engaging  into  some  algebra  that  will  expose  the  properties  of 
the  G-posterior.  First,  we  assume  p  >  0,  meaning  that  there  is  at  least  one  ex¬ 
planatory  variable  in  the  model.  We  define  the  matrix  P  =  X  {XTX}  XT. 
The  prior  tt  (a^fi^cr2^  can  then  be  decomposed  as 

6This  choice  is  more  problematic  when  conditioning  on  X  is  no  longer  possible, 
as  for  instance  when  X  contains  lagged  dependent  variables  (Chap.  7)  or  endogenous 
variables. 
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7r  (a, /3,  cr2)  oc  (cr 2 )  p^2  exp 
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since  X1PX  =  X1X.  Therefore, 

7r  (a,  /3,  cr2 |y)  oc  (ct2)-«/2-p/2-i 
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Since  l^X  =  Oj,,  we  deduce  that 
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/3TXTX/3  -  2y 1  X/3 


T- 


X 


exp 


exp 


exp 


(y-yin)T(y-yin)j  x 

1  ~  T 


2cr2 

n 

2^ 

1 

2g(j; 


(y  -  a)  \  x  exp 
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Since  PX  =  X,  we  deduce  that,  conditionally  on  y,  X  and  cr2,  the  parameters 
a  and  (3  are  independent  and  such  that 


a 


cr2,  y  ~  jV\  (y,cr2/n)  , 


/3\y,a- 


(p  +  Xp/g)  ,  ^  {XTX} 


-l 


9  +  1 


g  +  l 


where  $  =  {XTX}  1  XTy  is  the  maximum  likelihood  (and  least  squares) 
estimator  of  (3.  The  posterior  independence  between  a  and  / 3  is  due  to  the 
fact  that  X  is  centered  and  that  a  and  / 3  are  a  priori  independent. 

Moreover,  the  posterior  distribution  of  a2  is  given  by 


cr 


y  ~  /sf  (n  -  l)/2, s2  +  09  -  /3)TXTX(/3  -  /3)/(5  +  1) 


where  7£f(a,6)  is  an  inverse  Gamma  distribution  with  mean  b/(a  —  1)  and 
where  s2  =  (y  —  yln  —  X/3) 1  (y  —  yln  —  X/3)  corresponds  to  the  (classical) 
residual  sum  of  squares. 
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^  The  previous  derivation  assumes  that  p  >  0.  In  the  special  case  p  =  0,  which 
will  later  be  used  as  a  null  model  in  hypothesis  testing,  similar  arguments  lead 
to 

\y,a2  ~  Jf  (y,cr2/n)  , 


a 


a*\y  ~  !<£  |p  -  l)/2,  (y  -  yl„)T(y  -  yln)/2 
(There  is  no  f3  when  p  =  0,  as  this  corresponds  to  the  constant  mean  model.) 


Recalling  the  double  expectation  formulas 

E  [E  [X\Y]\  =  E  [X]  and  V(X)  =  V[E(X|V)]  +  E[V(X|V)] 


for  V(X\Y) 
tions  that 


and  that 


E[(X  —  E(X|V))2  |V],  we  can  derive  from  the  previous  deriva- 

E?r  Hy]  =  E?r  [E?r  pk2,y)  |y]  =  £7r  [y|y]  =  y 


V*(a\y)  =V(y|y)+E 


C J‘ 


n 


n/n(n  —  3) , 


where 


«  =  (y-yin)T(y-yi«)  + 
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<?y1Py  +  /3  XtPX/3  —  2y±PX 


T- 


g  + 1 


=  s  +{f3  —  /3)  X  X(/3  —  0)/ (<7  +  1) . 

With  a  bit  of  extra  algebra,  we  can  recover  the  whole  distribution  of  a  from 
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2cr2 
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This  means  that  the  marginal  posterior  distribution  of  a — the  distribution 
of  a  given  only  y  and  X — is  a  Student’s  t  distribution  with  n  —  1  degrees 
of  freedom,  a  location  parameter  equal  to  y  and  a  scale  parameter  equal  to 
Hi/n(n  —  1). 

If  we  now  turn  to  the  parameter  /3,  by  the  same  double  expectation  for¬ 
mula,  we  derive  that 


E*  [/3|y] 
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This  result  gives  its  meaning  to  the  above  point  relating  g  with  the  amount 
of  information  contained  in  the  dataset.  For  instance,  when  g  =  1,  the  prior 
information  has  the  same  weight  as  this  amount.  In  this  case,  the  Bayesian 
estimate  of  f3  is  the  average  between  the  least  square  estimator  and  the  prior 
expectation.  The  larger  g  is,  the  weaker  the  prior  information  and  the  closer 
the  Bayesian  estimator  is  to  the  least  squares  estimator.  For  instance,  when 
g  goes  to  infinity,  the  posterior  mean  converges  to  (3. 

Based  on  similar  derivations,  we  can  compute  the  posterior  variance  of  f3. 
Indeed, 
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9 


9  +  1 
Kg 


(. 9  +  1)0-  -  3) 


(P  +  p/g)  |y 


(xTx) 


+  E 


#0" 


Tv\-1 


g  + 1 


(X^X) 


Once  more,  it  is  possible  to  integrate  out  a2  in 
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Therefore,  the  marginal  posterior  distribution  of  (3  is  also  a  multivariate  Stu¬ 
dent’s  t  distribution  with  n  —  1  degrees  of  freedom,  location  parameter  equal 


to 
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A 

(/3  +  (3/ g)  and  scale  parameter  equal  to 


gK 


(XTX) 


9  +  1 v  (5,  +  l)(n  — 1) 

The  standard  Bayes  estimator  of  a2  for  this  model  is  the  posterior  expec¬ 
tation 
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n  —  3 


s2  +  ('P-  /3)tXtX(/3  -  f})/(g  +  1) 

n  —  3 


^  In  the  special  case  p  =  0,  by  using  similar  arguments,  we  get 

(y  -  yin)T(y  -  yin)  s2 


E7r  [a2|y]  = 


n  —  3 


n  —  3  ’ 


which  is  the  same  expectation  as  with  the  Jeffreys  prior. 


HPD  regions  on  subvectors  of  the  parameter  (3  can  be  derived  in  a  straight¬ 
forward  manner  from  this  marginal  posterior  distribution  of  (3.  For  a  single 
parameter,  we  have  for  instance 
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gK 

i)(s  +  iV 


where  ujjj  is  the  (j,  jf)-th  element  of  the  matrix  (XTX)  1.  If  we  set 

C  =  0  +  9&)/ ( 9  +  1) 

the  transform 

“  CV  \J  (n-l)1g  +  lf*j 

is  (marginally)  distributed  as  a  standard  t  distribution  with  n  —  1  degrees  of 
freedom.  A  (1  —  7)  HPD  interval  on  / 3j  has  therefore 


Ci  ±  / (n-l){g+  7/2) 

as  bounds,  where  F~\  denotes  the  quantile  function  of  the  3?(n—  1,0,1) 
distribution. 


3.4.2  The  BayesReg  R  Function 

We  have  created  in  bayess  an  R  function  called  BayesReg  to  implement  Zell- 
ner’s  G-prior  analysis  within  R.  The  purpose  is  dual:  first,  this  R  function 
shows  how  easily  automated  this  approach  can  be.  Second,  it  also  illustrates 
how  it  is  possible  to  get  exactly  the  same  type  of  output  as  the  standard  R 
function  summary  (lm(y~X)  ) . 

The  following  R  code  is  extracted  from  this  function  BayesReg  and  used 
to  calculate  the  Bayes  estimates.  As  an  aside,  notice  that  we  use  the  function 
stop  in  order  to  end  the  calculations  if  the  matrix  XTX  is  not  invertible. 

if  (det  (t  (X)°/o*0/0X)  <=le-7) 

stop ("Design  matrix  has  too  low  a  rank !", call . =FALSE) 

We  also  stress  the  use  of  scale  below  to  standardize  the  explanatory  variables. 

X=as . matrix (X) 
n=length(y) 
p=dim(X) [2] 

X=scale (X) 

U=solve  (t  (X)  °/0*°/0X)  °/o*°/ot  (X) 

#  MLE 

alphaml=mean (y ) 
bet  ami =U°/0  *  °/0y 

s2=t  (y-alphaml-X°/0*7obetaml)  °/o*°/o  (y-alphaml-X0/0*°/0betaml) 
kappa=as  .numeric (s2+t  (betatilde-betaml)0/0*0/0t  (X)0/o*0/0X0/o*0/o 
(betatilde-betaml) / (g+1) ) 
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malphabayes=alphaml 

mbetabayes=g/  (g+1)  *  (betaml+betatilde/g) 
msigma2bayes=kappa/ (n-3) 
valphabayes=kappa/ (n* (n-3) ) 

vbetabayes=diag(kappa*g/  (  (g+1)  *  (n-3)  )  *solve  (t  (X)0/0*0/0X) ) 
vsigma2bayes=2*kappa~2/  (  (n-3)  *  (n-4) ) 
postmean=c (malphabayes ,  inbet  abayes) 
postsd=sqrt (c (valphabayes , vbetabayes) ) 

#  evidence  of  the  model 
intlike=(g+l)~(-p/2)*kappa~(-(n-l)/2) 

We  will  see  further  aspects  of  BayesReg  in  the  following  sections. 


3.4.3  Bayes  Factors  and  Model  Comparison 


One  important  inferential  issue  pertaining  to  linear  models  is  to  test  whether 
or  not  a  specific  explanatory  variable  is  truly  explanatory  or,  in  other  words, 
to  decide  which  explanatory  variables  should  be  kept  within  the  model.  This 
leads  to  tests  on  the  nullity  of  some  elements  of  the  parameter  (3.  Following 
the  general  testing  methodology  presented  in  Chap.  2,  these  tests  can  be  con¬ 
ducted  using  Bayes  factors.  In  the  case  of  linear  models  and  under  Zellner’s 
G-priors,  those  Bayes  factors  are  actually  available  in  closed  form. 

When  considering  the  marginal  likelihood  (or  evidence)  at  the  core  of  the 
Bayes  factors,  we  have,  if  p  ^  0, 


/( y)  =  J  (J /  f{y\a,f3,a2)Tr([3\a,<j2)iT(<j2,a)dadf3j  da2, 


with 


f(y\a,/3,a2)Tr((3\a,a2) 


XTX 


1/2 


(2,7V  <j2)(n+p)  / ^  gP  / 2 

1 


exp{-^y(«-y)2} 


X 


exp 


exp 


2cr2 

1 

2  go- 


(y  -  yin  -  X/3)  (y  -  ylra  -  X/3) 


X 


(/ 9  -  /3)tXtX(/3  -p)  \  , 


and  7r(a,cr2)  =  5a  2  (where  S  is  an  arbitrary  constant).  Thus 


/( y)  =  5n~1/2(g  ^  irp/2(27r)-{n-1)/2  /  (a2)-(^-1)/2-l 


exp 


dcr2 


2  a2 


yr«n  -i)/2) 

7r(n-l)/2„l/2 

8r((n  —  l)/2) 

7r(n-l)/2„l/2 


(3  +  1) 
(3  +  1) 


-p/2 


S"  +  (/3-/3)tXtX(/3-/3)/(P  +  1) 


-!—(«— 1)/2 


-p/ 2K-(n-l)/2 


(3.4) 
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^  If  p  =  0,  a  similar  expression  emerges: 


with 


r\ 

The  integration  in  both  a  and  a  can  then  be  conducted  in  closed  form  and 
we  obtain 


as  the  evidence  associated  with  this  “null''  model.  The  evidence  corresponds  to 


intlikeO  in  the  BayesReg  code. 

As  pointed  out  in  Chap.  2,  the  computation  of  Bayes  factors  is  plagued  by 
the  inability  to  include  generic  improper  prior  distributions.  In  order  to  bypass 
this  difficulty,  we  will  assume  that  all  the  linear  models  under  comparison  do 
include  the  parameter  cp  which  means  that  each  regression  model  includes 
an  intercept.  This  assumption  allows  us  to  take  the  same  improper  prior 
(and  hence  the  same  arbitrary  constant  S)  on  (a,  a2)  for  all  of  those  models. 
Otherwise,  the  Bayes  factors  simply  cannot  be  correctly  defined. 

When  we  compare  two  sets  of  regressors,  we  have  to  handle  two  regres¬ 
sion  matrices,  X1  and  X2,  with  respective  dimensions  (n,pi)  and  (n,p 2), 
extracted  from  the  original  matrix  X  by  removing  some  columns.  From  a 
Bayesian  perspective,  using  Zellner’s  G-prior  modelling  in  both  cases,  we  are 
thus  comparing  model  Wli 


y\a,  ft1,  a2  ~  Jfn  (aln  +  X1ft1,  a2  In)  , 


ft1^,  a2  ^  ^YPl  [ft1 ,  g kt2((X1)tX1)  ^  ,  pi  ±  0 


with  model  OT2: 


y|a,  ft2,  <J2  ~  <sKi  (aln  +  X-2 ft2,  a2  I„) 


Using  the  above  derivations,  the  Bayes  factor  between  model  fJUli  and  model 
9JI2  is  then  given  by 
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B 12  (y) 


(31  +  l)“Pl/2 
(92  +  l)_p 2/2 


a?  +  A 

+  A 


^)T(xYx\^ 

/32)t(X2)tX2(/32 


P  )/ (9i  +  1) 

U/O2  +  1) 


—  (n— 1)/2 


-(n-l)/2  ' 


For  caterpillar,  if  we  have  to  test  the  null  hypothesis  #0  :  Ps  =  P9  =  0, 
using  p1  =  0§,  P2  =  06,  and  an  arbitrary  7  gi  =  ^2  =  100,  in  Zellner’s  G-priors, 
we  obtain  B\2  =  0.0165  when  model  TI2  corresponds  to  Hq.  Using  Jeffreys’ 
scale  of  evidence  (provided  in  Chap.  2),  this  implies  that  log12(Fq2)  =  —1.78, 
hence  that  the  posterior  distribution  appears  to  strongly  favor  H0. 

More  generally,  using  (3  =  0g  and  g  =  100,  we  can  produce  a  Bayesian 
regression  output,  programmed  in  R,  which  mimics  a  standard  software  re¬ 
gression  output  like  lm:  besides  the  estimation  of  the  /3y  ’s  via  their  posterior 

expectation,  we  include  the  Bayes  factors  B°l2,  in  the  log  scale  log10  (^12), 

corresponding  to  testing  the  null  hypotheses  Hq  :  /3j  =0.  (The  stars  are 
related  to  Jeffreys’  scale  of  evidence.) 


The  R  code  corresponding  to  this  “standard”  output  is  also  part  of  the  R 
function  BayesReg: 

bayesf actor=rep (0 ,p) 
p0=p-l  #  remove  one  variate 
X0=X[,-j] 

U0=solve  (t  (X0)  °/o*°/oXO)  °/0*70t  (X0) 
bet  at  i  ldeO=U00/o*0/oX0/o*°/obet  at  i  lde 
betamlO=U00/o*0/0y 

s20=t  (y-alphaml-XO°/o*0/obetamlO)7o*°/o(y-alphaml-X07o*7obetamlO) 
kappa0=as  .  numeric  (s20+t  (betatildeO-betamlO)7o*7ot  (XO)7o*7o 
X07o*7o  (betatildeO-betamlO)  /  (g+1) ) 
intlike0= (g+1) ^ (-pO/2) *kappa0^ (- (n-1) /2) 
bayesf actor [j] =intlike/ intlikeO 

where  inti  ike  is  the  marginal  likelihood  for  the  full  model.  (The  way  this 
computation  is  repeated  and  used  to  mimic  the  output  of  the  lm  function  can 
be  found  by  reading  the  function  BayesReg.) 

For  the  caterpillar  dataset,  (3  =  0g  and  g  =  n  =  33,  the  G-prior  estimate 
of  a2  is  equal  to  0.653,  while  the  posterior  means  and  standard  variations  of 
the  /3j: s  are  given  below.  We  can  immediately  spot  that  the  (most)  significant 
explanatory  variables  are  the  same  ones  as  those  selected  by  lm,  aq,  #2,  and 
xj.  Note,  however,  that  this  output  does  not  rigorously  validate  the  selection 
of  the  submodel  with  the  covariates  aq,  oq,  and  oq,  as  it  does  not  produce  the 
Bayes  factor  associated  with  this  (sub) model  and  the  full  model. 

7 Arbitrary  means  here  that  this  choice  is  no  more  justified  than  any  other.  We  will 
see  later  that  gj  —  n  is  the  recommended  or  default  value  for  non-informative 
settings. 
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>  resl=BayesReg(y , X) 

PostMean  PostStError  LoglObf  EvidAgaHO 


Intercept 

xl 

-0.8133 

-0.5039 

0.1407 

0.1883 

0.7224 

(**) 

x2 

-0.3755 

0.1508 

0.5392 

(**) 

x3 

0.6225 

0.3436 

-0.0443 

x4 

-0.2776 

0.2804 

-0.5422 

x5 

-0.2069 

0.1499 

-0.3378 

x6 

0.2806 

0.4760 

-0.6857 

x7 

-1.0420 

0.4178 

0.5435 

(**) 

x8 

-0.0221 

0.1531 

-0.7609 

Posterior  Mean  of  Sigma2:  0.6528 
Posterior  StError  of  Sigma2:  0.939 


3.4.4  Prediction 


The  prediction  of  m  >  1  future  observations  from  units  for  which  the  explana¬ 
tory  variables  X — but  not  the  outcome  variable  y — have  been  observed  or  set 
is  also  based  on  the  posterior  distribution.  Logically  enough,  were  a,  (3  and 
a2  known  quantities,  the  m-vector  y  would  then  have  a  Gaussian  distribution 
with  mean  alm  +  X/3  and  variance  cr2Im.  The  predictive  distribution  on  y  is 
defined  as  the  marginal  in  y  of  the  joint  posterior  distribution  on  (y,  a,  (3,  a2). 

Conditional  on  a2,  the  vector  y  of  future  observations  has  a  Gaussian  dis- 
tribution  and  we  can  derive  its  expectation — used  as  our  Bayesian  estimator- 
by  averaging  over  a  and  /3, 


E^ylcrty]  =  W\E*{y\a,f3,<j2,y)\<j2,y] 
=  E*[alm  +  X/3|<r2,y] 

P  +  gP 


1  m  +  X 


9  +  1 


which  is  independent  from  a2.  This  representation  is  quite  intuitive,  being  the 
product  of  the  matrix  of  explanatory  variables  X  by  the  Bayesian  estimator 
of  (3.  Similarly,  we  can  compute 


VVy|cr2,y)  =  E7r[¥7r(y|a,  ^,(72,y)|cr2,y] 

+V7r(E7r(y|a,  P,  <t2)|<t2,  y) 

=  W[<j2Im\<j2,y\  +V,r(alm  +X/3|a2,y) 

.2  It  ,  9  v/vTv\  — lvT 


—  O’  I  Im  + 


g  + 1 


X(X'X)“iX 
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Due  to  this  factorization,  and  the  fact  that  the  conditional  expectation  does 
not  depend  on  cr2,  we  thus  obtain 


W(y|y)  =  a2  (jm  +  -^±(X.TX)-l±T 

This  decomposition  of  the  variance  makes  perfect  sense:  Conditionally 
on  cr2,  the  posterior  predictive  variance  has  two  terms,  the  first  term  be¬ 
ing  cr2/m,  which  corresponds  to  the  sampling  variation,  and  the  second  one 
being  a2^IX(XTX)-1XT,  which  corresponds  to  the  uncertainty  about  (3. 

HPD  credible  regions  and  tests  can  then  be  conducted  based  on  this  con¬ 
ditional  predictive  distribution 


y|cr2,y,cr2  -  jV  (E^y],  V^yly,  a2)) 


Integrating  cr2  out  to  produce  the  marginal  distribution  of  y  leads  to  a  mul¬ 
tivariate  Student’s  t  distribution 


y|y  ~  (n,aim  +  gP/{g  + l), 


s2+/3  XtX/3 


n 


{lm+X(XTX)-1XT 


(following  a  straightforward  but  lengthy  derivation  that  is  very  similar  to  the 
one  we  conducted  at  the  end  of  Chap.  2,  see  (2.11)). 


3.5  Markov  Chain  Monte  Carlo  Methods 

Given  the  complexity  of  most  models  encountered  in  Bayesian  modeling,  stan¬ 
dard  simulation  methods  are  not  a  sufficiently  versatile  solution.  We  now 
present  the  rudiments  of  a  technique  that  emerged  in  the  late  1980s  as  the 
core  of  Bayesian  computing  and  that  has  since  then  revolutionized  the  field. 

This  technique  is  based  on  Markov  chains ,  but  we  will  not  make  many 
incursions  into  the  theory  of  Markov  chains  (see  Meyn  and  Tweedie,  1993), 
focusing  rather  on  the  practical  implementation  of  these  algorithms  and  trust¬ 
ing  that  the  underlying  theory  is  sound  enough  to  validate  them  (Robert  and 
Casella,  2004).  At  this  point,  it  is  sufficient  to  recall  that  a  Markov  chain 
(xt)teN  is  a  sequence  of  dependent  random  vectors  whose  dependence  on  the 
past  values  xq,  . . . ,  x*_ i  stops  at  the  value  immediately  before,  x^-i,  and  that 
is  entirely  defined  by  its  kernel — that  is,  the  conditional  distribution  of  xt 
given  xt_i. 

The  central  idea  behind  these  new  methods,  called  Markov  chain  Monte 
Carlo  (MCMC)  algorithms,  is  that,  to  simulate  from  a  distribution  tt  (for  in¬ 
stance,  the  posterior  distribution),  it  is  actually  sufficient  to  produce  a  Markov 
chain  (x.t)te n  whose  stationary  distribution  is  tt:  when  x*  is  marginally  dis¬ 
tributed  according  to  t r,  then  xt+ 1  is  also  marginally  distributed  according  to 
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7 r.  If  an  algorithm  that  generates  such  a  chain  can  be  constructed,  the  ergodic 
theorem  guarantees  that,  in  almost  all  settings,  the  average 

7=1 

converges  to  E[g(x)],  no  matter  what  the  starting  value.8 9 

More  informally,  this  property  means  that,  for  large  enough  t,  ~xt  is  ap¬ 
proximately  distributed  from  tt  and  can  thus  be  used  like  the  output  from  a 
more  standard  simulation  algorithm  (even  though  one  must  take  care  of  the 
correlation  between  the  xt’s  created  by  the  Markovian  structure).  For  integral 
approximation  purposes,  the  difference  from  regular  Monte  Carlo  approxima¬ 
tions  is  that  the  variance  structure  of  the  estimator  is  more  complex  because  of 
the  Markovian  dependence.  These  methods  being  central  to  the  cases  studied 
from  this  stage  onward,  we  hope  that  the  reader  will  become  sufficiently  profi¬ 
cient  with  them  by  the  end  of  the  book!  In  this  chapter,  we  detail  a  particular 
type  of  MCMC  algorithm,  the  Gibbs  sampler,  that  is  currently  sufficient  for 
our  needs.  The  next  chapter  will  introduce  a  more  universal  type  of  algorithm. 


3.5.1  Conditionals 


A  first  remark  that  motivates  the  use  of  the  Gibbs  sampler^  is  that,  within 
structures  such  as 

7r(xi)  =  j  7Ti(xi  |x2)tt(x2)  d^2  ,  (3-5) 

to  simulate  from  the  joint  distribution 

7r(xi,  X\)  =  7Ti(xi\x2)fc(x2)  (3.6) 

automatically  produces  (marginal)  simulation  from  tt(xi).  Therefore,  in  set¬ 
tings  where  (3.5)  holds,  it  is  not  necessary  to  simulate  from  tt(xi)  when  one 
can  jointly  simulate  (or,  £2)  from  (3.6). 

For  example,  consider  (aq,  X2)  G  Nx  [0, 1]  distributed  from  the  joint  density 


7t(x1,X2)  (X  ^  1(1  —  X2)U  Xl+/3  1. 


This  is  a  joint  distribution  where 


x\ \x2  ~  ^(n,  X2)  and  X2  |a,  /3  ~  8Se(aL,  (3) . 


8 In  probabilistic  terms,  if  the  Markov  chains  produced  by  these  algorithms  are 
irreducible ,  then  these  chains  are  both  positive  recurrent  with  stationary  distribution 
7r  and  ergodic ,  that  is,  asymptotically  independent  of  the  starting  value  xo. 

9In  the  literature,  both  the  denominations  Gibbs  sampler  and  Gibbs  sampling 
can  be  found.  In  this  book,  we  will  use  Gibbs  sampling  for  the  simulation  technique 
and  Gibbs  sampler  for  the  simulation  algorithm. 
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Therefore,  although 

(  n\ B{a  +  x\,  f3  +  n  —  x\) 

- w - 

is  available  in  closed  form  as  the  beta-binomial  distribution ,  it  is  unnecessary 
to  work  with  this  marginal  when  one  can  simulate  an  iid  sample  (  X  -j-  ,  X  ^  ) 
(t  =  1, . . . ,  N)  as 


~  88e{oi,  f3)  and  ~  &(n,  x^) . 
Integrals  such  as  K[xi/(xi  +  1)]  can  then  be  approximated  by 


1 

N 


E 


x 


+  1 


using  a  regular  Monte  Carlo  approach. 

Unfortunately,  even  when  one  works  with  a  representation  such  as  (3.6) 
that  is  naturally  associated  with  the  original  model,  it  is  often  the  case  that 
the  mixing  density  tt(x2)  itself  is  neither  available  in  closed  form  nor  amenable 
to  simulation.  However,  both  conditional  posterior  distributions , 


7Ti  {x\  ^2)  and  7T2 (^2  |^i ) , 

can  often  be  simulated,  and  the  following  method  takes  full  advantage  of  this 
feature. 


3.5.2  Two-Stage  Gibbs  Sampler 

The  availability  of  both  conditionals  of  (3.6)  in  terms  of  simulation  can  be 
exploited  to  build  a  transition  kernel  and  a  corresponding  Markov  chain,  some¬ 
what  analogous  to  the  derivation  of  the  maximum  of  a  multivariate  function 
via  an  iterative  device  that  successively  maximizes  the  function  in  each  of  its 
arguments  until  a  fixed  point  is  reached. 

The  corresponding  Markov  kernel  is  built  by  simulating  successively  from 
each  conditional  distribution,  with  the  conditioning  variable  being  updated 
on  the  run.  It  is  called  the  two-stage  Gibbs  sampler  or  sometimes  the  data 
augmentation  algorithm,  although  both  terms  are  rather  misleading.10 


10  Gibbs  sampling  got  its  name  from  Gibbs  fields ,  used  in  image  analysis,  when  Ge- 
man  and  Geman  (1984)  proposed  an  early  version  of  this  algorithm,  while  data 
augmentation  refers  to  Tanner’s  (1996)  special  use  of  this  algorithm  in  missing-data 
settings,  as  seen  in  Chap.  6. 
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Algorithm  3.3  Two-Stage  Gibbs  Sampler 


Initialization:  Start  with  an  arbitrary  value  x 
Iteration  t\  Given  1  ,  generate 
1.  according  to  tt\(x\ \x%  , 


2.  x^  according  to  TT2(x2\x\L)) 


V) 


(o) 

2 


Note  that,  in  the  second  step  of  the  algorithm,  x^  is  generated  conditional 

on  x\  —  x±  \  not  x^  1\  The  validation  of  this  algorithm  is  that,  for  both 
generations,  tt  is  a  stationary  distribution.  Therefore,  the  limiting  distribution 
of  the  chain  (x^\x2  )t  is  11  if  Xhe  chain  is  irreducible ;  that  is,  if  it  can  reach 
any  region  in  the  support  of  tt  in  a  finite  number  of  steps.  (Note  that  there  is 
a  difference  between  the  stationary  distribution  and  the  limiting  distribution 
only  in  cases  when  the  chain  is  not  ergodic,  as  shown  in  Exercise  3.9.) 

The  practical  implementation  of  Gibbs  sampling  involves  solving  two  types 
of  difficulties:  the  first  type  corresponds  to  deriving  an  efficient  decomposition 
of  the  joint  distribution  in  easily-simulated  conditionals  and  the  second  one  to 
deciding  when  to  stop  the  algorithm.  Evaluating  the  efficiency  of  the  decom¬ 
position  includes  assessing  the  ease  of  simulating  from  both  conditionals  and 
the  level  of  correlation  between  the  x^’s,  as  well  as  the  mixing  behavior  of 
the  chain,  that  is,  its  ability  to  explore  the  support  of  tt  sufficiently  fast.  While 
deciding  whether  or  not  a  given  conditional  can  be  simulated  is  easy  enough, 
it  is  not  always  possible  to  find  a  manageable  conditional,  and  more  robust 
alternatives  such  as  the  Metropolis-Hastings  algorithm  will  be  described  in 
the  following  chapters  (see  Sect.  4.2). 

Choosing  a  stopping  rule  also  relates  to  the  mixing  performances  of  the 
algorithm,  as  well  as  to  its  ability  to  approximate  posterior  expectations  un¬ 
der  tt.  Many  indicators  have  been  proposed  in  the  literature  (see  Robert  and 
Casella,  2004,  Chap.  12)  to  signify  convergence,  or  lack  thereof,  although  none 
of  these  is  foolproof.  In  the  easiest  cases,  the  lack  of  convergence  is  blatant  and 
can  be  spotted  on  the  raw  plot  of  the  sequence  of  the  x^’s,  while,  in  other 
cases,  the  Gibbs  sampler  explores  very  satisfactorily  one  mode  of  the  posterior 
distribution  but  fails  altogether  to  visit  the  other  modes  of  the  posterior:  we 
will  encounter  such  cases  in  Chap.  6  with  mixtures  of  distributions.  Through¬ 
out  this  chapter  and  the  following  ones,  we  give  hints  on  how  to  implement 
these  recommendations  in  practice. 

Consider  the  posterior  distribution  derived  in  Exercise  2.11,  for  n  =  2 
observations, 

e-U2/ 20 

7r(Ml^2)  00  {IT (VrilTTETF} ' 

Even  though  this  is  a  univariate  distribution,  it  can  still  be  processed  by  a 
Gibbs  sampler  through  a  data  augmentation  step,  thus  illustrating  the  idea 
behind  (3.5).  In  fact,  since  (j  =  1,2) 
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1 

1  -\-  (pCj  —  /i)2 


Uj[l+(xj  V)2]  fa. 


we  can  define  u  =  and  envision  7t(/x|^2)  as  the  marginal  distribu¬ 

tion  of 

2 

7r(^,u>|^2)  oc  e-^20  x  JJ  _ 

i=i 

For  this  multivariate  distribution,  a  corresponding  Gibbs  sampler  is  associated 
with  the  following  two  steps: 

1.  Generate  ~  7r(/i|o/t_1),  f^)  ■ 

2.  Generate  ~  n(u;\p^t\  @2)  ■ 

The  second  step  is  straightforward:  the  oy/s  are  conditionally  independent 
and  distributed  as  £>xp(  1  +  (o^  —  p^)2).  The  first  step  is  also  well-defined 
since  7r(p\u>,  @2)  is  a  normal  distribution  with  mean  JT  co^Xi/(  JT  +  1/20) 
and  variance  1/(2  JT  + 1/ 10).  The  corresponding  R  program  then  simplifies 
into  two  lines 


n 


Fig.  3.4.  (Top)  Last  100  iterations  of  the  chain  (/b^);  ( bottom )  histogram  of  the 
chain  (p^)  and  comparison  with  the  target  density  for  10,000  iterations 
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>  mu  =  rnorm(l , sum (x* omega) /sum(omega+ . 05) , 

+  sqrt (1/ (. l+2*sum (omega) )) ) 

>  omega  =  rexp(2, l+(x-mu) ~2) 

and  the  output  of  the  simulation  is  represented  in  Fig.  3.4,  with  a  very  sat¬ 
isfying  fit  between  the  histogram  of  the  simulated  values  and  the  target.  A 
detailed  zoom  on  the  last  100  iterations  shows  how  the  chain  (/i^)  moves 
around,  alternatively  visiting  each  mode  of  the  target. 

^  When  running  a  Gibbs  sampler,  the  number  of  iterations  should  never  be  fixed 
in  advance:  it  is  usually  impossible  to  predict  the  performance  of  a  given  sampler 
before  producing  a  corresponding  chain.  Deciding  on  the  length  of  an  MCMC 
run  is  therefore  a  sequential  process  where  output  behaviors  are  examined  after 
pilot  runs  and  new  simulations  (or  new  samplers)  are  chosen  on  the  basis  of 
these  pilot  runs. 


3.5.3  The  General  Gibbs  Sampler 


For  a  joint  distribution  7r(xi, . . . ,  xp )  with  full  conditionals  7Ti, . . . ,  ttp  where  i Tj 
is  the  distribution  of  Xj  conditional  on  (aq, . . . ,  Xj_i,  . . . ,  xp),  the  Gibbs 
sampler  simulates  successively  from  all  conditionals,  modifying  one  compo¬ 
nent  of  x  at  a  time.  The  corresponding  algorithmic  representation  is  given  in 
Algorithm  3.4. 


Algorithm  3.4  Gibbs  Sampler 

Initialization:  Start  with  an  arbitrary  value  x^0)  =  (x^°  , . . . ,  ) . 

Iteration  t\  Given  (aq  ,  ...,Xp  1 ) ,  generate 

1.  Xi  according  to  tti(xi\x2~1\  . ..,  xp~^) , 

2.  x^  according  to  7T2(x2\x^\  x^~X\  ...,  xp~^) , 


p.  Xp  according  to  ttp(xp 


x 


Quite  logically,  the  validation  of  this  generalization  of  Algorithm  3.3  is 
identical:  for  each  of  the  p  steps  of  the  t- th  iteration,  the  joint  distribution 
7 r(x)  is  stationary.  Under  the  same  restriction  on  the  irreducibility  of  the 
chain,  it  also  converges  to  it  for  every  possible  starting  value.  Note  that  the 
order  in  which  the  components  of  x  are  simulated  can  be  modified  at  each 
iteration,  either  deterministically  or  randomly,  without  putting  the  validity  of 
the  algorithm  in  jeopardy. 
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The  two-stage  Gibbs  sampler  naturally  appears  as  a  special  case  of  Algo¬ 
rithm  3.4  for  p  =  2.  It  is,  however,  endowed  with  higher  theoretical  properties, 
as  detailed  in  Robert  and  Casella  (2004,  Chap.  9)  and  Robert  and  Casella 
(2009,  Chap.  7). 

To  conclude  this  section,  let  us  stress  that  the  impact  of  MCMC  on 
Bayesian  statistics  has  been  considerable.  Since  the  1990s,  which  saw  the 
emergence  of  MCMC  methods  in  the  statistical  community,  the  occurrence  of 
Bayesian  methods  in  applied  statistics  has  greatly  increased,  and  the  frontier 
between  Bayesian  and  “classical”  statistics  is  now  so  fuzzy  that  in  some  fields, 
it  has  completely  disappeared.  From  a  Bayesian  point  of  view,  the  access  to 
far  more  advanced  computational  means  has  induced  a  radical  modification 
of  the  way  people  work  with  models  and  prior  assumptions.  In  particular, 
it  has  opened  the  way  to  process  much  more  complex  structures,  such  as 
graphical  models  and  latent  variable  models  (see  Chap.  6).  It  has  also  freed 
inference  by  opening  for  good  the  possibility  of  Bayesian  model  choice  (see, 
e.g.,  Robert,  2007,  Chap.  7).  This  expansion  is  much  more  visible  among 
academics  than  among  applied  statisticians,  though,  given  that  the  use  of 
the  MCMC  technology  requires  some  “hard”  thinking  to  process  every  new 
problem.  The  availability  of  specific  software  such  as  BUGS  has  nonetheless 
given  access  to  MCMC  techniques  to  a  wider  community,  starting  with  the 
medical  field.  New  modules  in  R  and  other  languages  like  Python  are  also 
helping  to  bridge  the  gap. 


3.6  Variable  Selection 

3.6.1  Deciding  on  Explanatory  Variables 

In  an  ideal  world,  when  building  a  regression  model,  we  should  include  all  rele¬ 
vant  pieces  of  information,  which  in  the  regression  context  means  including  all 
predictor  variables  that  might  possibly  help  in  explaining  y.  However,  there 
are  obvious  drawbacks  to  the  advice  of  increasing  the  number  of  explanatory 
variables.  For  one  thing,  in  noninformative  settings,  this  eventually  clashes 
with  the  constraint  p  <  n.  For  another,  using  a  huge  number  of  explana¬ 
tory  variables  leaves  little  information  available  to  obtain  precise  estimators. 
In  other  words,  when  we  increase  the  explanatory  scope  of  the  regression 
model,  we  do  not  necessarily  increase  its  explanatory  power  because  it  gets 
harder  and  harder  to  estimate  the  coefficients.11  It  is  thus  important  to  be 

11  This  phenomenon  is  related  to  the  principle  of  parsimony,  also  called  Occam’s 
razor ,  which  states  that,  among  two  models  with  similar  explanatory  powers,  the 
simplest  one  should  always  be  preferred.  It  is  also  connected  with  the  learning  curve 
effect  found  in  information  theory  and  neural  networks,  where  the  performance  of 
a  model  increases  on  the  learning  dataset  but  decreases  on  a  testing  dataset  as  its 
complexity  increases. 
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able  to  decide  which  variables — within  a  large  pool  of  potential  explanatory 
variables — should  be  kept  in  a  model  that  balances  good  explanatory  power 
with  good  estimation  performance. 

This  is  truly  a  decision  problem  in  that  all  potential  models  have  to  be  con¬ 
sidered  in  parallel  against  a  criterion  that  ranks  them.  This  variable-selection 
problem  can  be  formalized  as  follows.  We  consider  a  dependent  random  vari¬ 
able  y  and  a  set  of  p  potential  explanatory  variables.  At  this  stage,  we  assume 
that  every  subset  of  q  explanatory  variables  could  make  a  proper  set  of  ex¬ 
planatory  variables  for  the  regression  of  y.  The  only  restriction  we  impose  is 
that  the  intercept  (that  is,  the  constant  variable)  is  included  in  every  model. 
There  are  thus  2P  models  in  competition  and  we  are  looking  for  a  procedure 
that  selects  the  “best”  model,  that  is,  the  “most  relevant”  explanatory  vari¬ 
ables.  Note  that  this  variable-selection  procedure  can  alternatively  be  seen 
as  a  two-stage  estimation  setting  where  we  first  estimate  the  indicator  of  the 
model  (within  the  collection  of  models),  which  also  amounts  to  estimating 
variable  indicators,  as  detailed  below,  and  we  then  estimate  the  parameters 
corresponding  to  this  very  model. 

Each  of  the  2P  models  under  comparison  is  in  fact  associated  with  a  binary 
indicator  vector  7  E  r  =  {0, 1}P,  where  7 j  =  1  means  that  the  variable  Xj 
is  included  in  the  model,  denoted  by  This  notation  is  quite  handy  since 
7=(  1,0, 1,0,0,. .  .  ,1,0)  clearly  indicates  which  explanatory  variables  are  in  and 
which  are  not.  We  also  use  the  notation 


for  computing  the  number  of  variables  included  in  the  model  We  de¬ 
fine  (3 T  as  a  sub-vector  of  (3  containing  only  the  components  such  that  x3  is 
included  in  the  model  Wl*y  and  X7  as  the  sub-matrix  of  X  where  only  the 
columns  such  that  Xj  is  included  in  the  model  have  been  left.  The  model 
201, y  is  thus  defined  as 


y 


o-2, 7  ~  -V  (aln  +  (FTC1  (F 


*[  Once  again,  and  apparently  in  contradiction  to  our  basic  tenet  that  different 
models  should  enjoy  completely  different  parameters,  we  are  compelled  to  de- 

r\ 

note  by  a  and  a  the  variance  and  intercept  terms  common  to  all  models, 
respectively.  Although  this  is  more  of  a  mathematical  trick  than  a  true  model¬ 
ing  reason,  the  prior  independence  of  (a,  cr2)  and  7  allows  for  the  simultaneous 
use  of  Bayes  factors  and  an  improper  prior.  Despite  the  possibly  confusing  nota¬ 
tion,  (3T1  and  /3  are  completely  unrelated  in  that  they  are  parameters  of  different 
models. 
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3.6.2  G-Prior  Distributions  for  Model  Choice 


Because  so  many  models  are  in  competition  and  thus  considered  in  the  global 
model  all  at  once,  we  cannot  expect  a  practitioner  to  specify  one’s  own  prior 
on  every  model  in  a  completely  subjective  and  autonomous  manner.  We 
thus  now  proceed  to  derive  all  priors  from  a  single  global  prior  associated  with 
the  so-called  full  model  that  corresponds  to  7  =  (1, . . . ,  1).  The  argument  goes 
as  follows: 

(1)  For  the  full  model,  we  use  Zellner’s  G-prior  as  defined  in  Sect.  3.4, 

j3\(j2  ~  'A'pifi,  gcr2(XTX)_1)  and  7r(a,  cr2)  oc  a~2  . 


(2)  For  each  (sub-)model  the  prior  distribution  of  (3 1  conditional  on  a2 
is  fixed  as 


(  P7 ,  ga*  IX-r '  Xf 


-1 


where  (31  =  (  XtJ  XT  )  XTi  X/3  and  we  use  the  same  prior  on  (a,  a2). 


>T- 


-1 


<T 


i 


This  distribution  is  conditional  on  7;  in  particular,  this  implies  that,  while  the 
variance  notation  a2  is  common  to  all  models,  its  distribution  varies  with  7. 


Although  there  are  many  possible  ways  of  defining  the  prior  on  the  model 
index 1  7,  we  opt  for  the  uniform  prior 

7r(7)  =  2~p  . 

The  posterior  distribution  of  7  (that  is,  the  distribution  of  7  given  y)  is  central 
to  the  variable-selection  methodology  since  it  is  proportional  to  the  marginal 
density  of  y  in  In  addition,  for  prediction  purposes,  the  prediction  dis¬ 
tribution  can  be  obtained  by  averaging  over  all  models,  the  weights  being  the 
model  probabilities  (this  is  called  model  averaging) . 

The  posterior  distribution  of  7  is 


V7|y)  oc  /(y|7)7r(7)  oc  /(y|7) 


oc  (g  +  1  )-(9t+1)/2 

1 


yTy 


9  _Tw  twT 


-1 


5  +  1 


g  + 1 
-(«— 1)/2 


y  X7  ( X7  X7 


X^Ty 


(3.7) 


When  the  number  of  explanatory  variables  is  less  than  15,  say,  the  exact 
derivation  of  the  posterior  probabilities  for  all  submodels  can  be  undertaken. 

12 For  instance,  one  could  instead  use  a  uniform  prior  on  the  number  <37  of  ex¬ 
planatory  variables  or  a  more  parsimonious  prior  such  as  77(7)  =  1  /qy. 
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Indeed,  2 15  =  32768  means  that  the  problem  remains  tractable.  The  following 
R  code  (part  of  the  function  ModChoBayesReg)  is  used  to  calculate  those  pos¬ 
terior  probabilities  and  returns  the  top  most  probable  models.  The  integrated 
likelihood  for  the  null  model  is  computed  as  intlikeO. 

intlike=rep(intlikeO , 2~p) 
for  (j  in  2:2~p){ 

gam=as . integer (intToBits ( i— 1 )  [1 :p] ==1) 
pgam=sum(gam) 

Xgam=X [, which (gam==l)] 

Ugam=solve  (t  (Xgam)  0/o*0/0Xgam)  0/0*0/0t  (Xgam) 
betatildegam=bl=Ugam7o*0/0X0/o*7obetatilde 

betamlgam=b2 =Ugam7o  *  70y 

s2gam=t  (y-alphaml-Xgam7o*7ob2)  7o*7o  (y-alphaml-Xgam7o*7ob2) 
kappagam=as  .numeric  (s2gam+t  (bl-b2)7o*7ot  (Xgam)7o*7o 
Xgam7o*7o  (bl-b2)  /  (g+1) ) 

inti ike [j] =(g+l) ~ (-pgam/2) *kappagam~ (- (n-1) /2) 

> 

intlike=intlike/ sum(intlike) 
modcho=order  (intlike)  [2~p:  (2~p-9)] 
probtoplO=intlike [modcho] 

The  above  R  code  uses  the  generic  function  intToBits  to  turn  an  integer  i 
into  the  indicator  vector  gam.  The  remainder  of  the  code  is  quite  similar  to 
the  model  choice  code  when  computing  the  Bayes  factors. 


For  the  caterpillar  data,  we  set  /?  =  Os  and  g  =  1.  The  models  corre¬ 
sponding  to  the  top  10  posterior  probabilities  are  then  given  by 

>  ModChoBayesReg (y ,X,g=l) 

Number  of  variables  less  than  15 

Model  posterior  probabilities  are  calculated  exactly 


ToplOModels 

PostProb 

1 

1 

2 

3 

7 

0.0142 

2 

1 

2 

3 

5 

7 

0.0138 

3 

1 

2 

7 

0.0117 

4 

1 

2 

3 

4 

7 

0.0112 

5 

1 

2 

3 

4 

5 

7 

0.0110 

6 

1 

2 

5 

7 

0.0108 

7 

1 

2 

3 

7 

8 

0.0104 

8 

1 

2 

3 

6 

7 

0.0102 

9 

1 

2 

3 

5 

6 

7 

0.0100 

10 

1 

2 

3 

5 

7 

8 

0.0098 
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In  a  basic  0  —  1  decision  setup,  we  would  choose  the  model  with  the 
highest  posterior  probability — that  is,  the  model  with  explanatory  variables 
#i,  #2,  %3  and  X7 — which  corresponds  to  the  variables 

altitude, 

slope, 

the  number  of  pine  trees  in  the  area,  and 
the  number  of  vegetation  strata. 

The  model  selected  by  the  procedure  thus  fails  to  correspond  to  the  three 
variables  identified  in  the  R  output  at  the  end  of  Sect.  3.4.  But  interestingly, 
even  under  this  strong  shrinkage  prior  g  =  1  (where  the  prior  has  the  same 
weight  as  the  data),  all  top  ten  models  contain  the  explanatory  variables  aq, 
X2  and  £7,  which  have  the  most  stars  in  this  R  analysis. 


Now,  the  default  or  noninformative  calibration  of  the  (V-prior  corresponds 
to  the  choice  f3  =  0P  and  g  =  n,  which  reduces  the  prior  input  to  the  equivalent 
of  a  single  observation.  Pushing  g  to  a  smaller  value  results  in  a  paradoxical 
behaviour  of  the  procedure  which  then  usually  picks  the  simpler  model:  this 
is  another  illustration  of  the  Jeffrey s-Lindley  paradox ,  mentioned  in  Chap.  2. 


For  (3  =  0P  and  g  =  n,  the  ten  most  likely  models  and  their  posterior 
probabilities  are: 

>  ModChoBayesReg(y ,X) 


Number  of  variables  less  than  15 

Models’s  posterior  probabilities  are  calculated  exactly 


1 

2 

3 

4 

5 

6 

7 

8 

9 

10 


ToplOModels 

PostProb 

1 

2 

7 

0.0767 

1 

7 

0.0689 

1 

2 

3 

7 

0.0686 

1 

3 

7 

0.0376 

1 

2 

6 

0.0369 

1  2 

3 

5 

7 

0.0326 

1 

2 

5 

7 

0.0294 

1 

6 

0.0205 

1 

2 

4 

7 

0.0201 

7 

0.0198 

For  this  different  prior  modelling,  we  chose  the  same  model  as  the  lm  clas¬ 
sical  procedure,  rather  than  when  g  —  1;  however,  the  posterior  probabilities 
of  the  most  likely  models  are  much  lower  for  g  =  1,  which  is  logical  given 
that  the  current  prior  is  less  informative.  Therefore,  the  top  model  is  not  as 
strongly  supported  as  in  the  informative  case.  Once  again,  we  stress  that  the 
choice  g  =  1  is  rather  arbitrary  and  that  it  is  used  here  merely  for  illustrative 
purposes.  The  default  value  we  recommend  is  g  =  n. 
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3.6.3  A  Stochastic  Search  for  the  Most  Likely  Model 


When  the  number  p  of  variables  is  large,  it  becomes  impossible  to  compute 
the  posterior  probabilities  for  the  whole  series  of  2P  models.  We  then  need  a 
tailored  algorithm  that  samples  from  7r(7|y)  and  thus  selects  the  most  likely 
models,  without  computing  first  all  the  values  of  7r(7|y).  This  can  be  done 
rather  naturally  by  Gibbs  sampling,  given  the  availability  of  the  full  condi¬ 
tional  posterior  probabilities  of  the  7/s. 


Indeed,  if  7  •  (1  <  j  <  p)  is  the  vector  (71, . . . , 7j_i, 7j+i,  •  •  • , 7p),  the 
full  conditional  distribution  ir^j  |y,  7  ■)  of  7 j  is  proportional  to  7r(7|y)  and 
can  be  computed  in  both  7 j  =  0  and  7 j  =  1  at  no  cost  (since  these  are  the 
only  possible  values  of  7 j). 


Algorithm  3.5  Gibbs  Sampler  for  Variable  Selection 


Initialization:  Draw  70  from  the  uniform  distribution  on  r . 
Iteration  t\  Given  (7^  , . . . ,  7^  1^)l  generate 

1.  7  P  according  to  7r(7i|y,7^_1),...,7^_1)), 

2.  7^  according  to  7r(72|y,7P,7^_1),...,7^_1)), 


P- 


according  to  7r(7P|y,  7i 


(t) 


After  a  large  number  of  iterations  of  this  algorithm  (that  is,  when  the 
sampler  is  supposed  to  have  converged  or,  more  accurately,  when  the  sampler 
has  sufficiently  explored  the  support  of  the  target  distribution),  its  output 
can  be  used  to  approximate  the  posterior  probabilities  7r(7|y,  X)  by  empirical 
averages  based  on  the  Gibbs  output, 

P"(7  =  7*|y)  =  (r_yo  +  1)  E  V>=7*  - 

t=To 


where  the  Tq  first  values  are  eliminated  as  burn-in.  (The  number  Tq  is  therefore 
the  number  of  iterations  roughly  needed  to  “reach”  convergence.)  The  Gibbs 
output  can  also  be  used  to  approximate  the  inclusion  of  a  given  variable, 
P^hj  =  l|yW),  as 


T 


E  b(t,=i’ 

t=T0 


with  the  same  asymptotic  validation. 

The  following  R  code  (again  part  of  the  function  ModChoBayesReg)  de¬ 
scribes  our  implementation  of  the  above  variable-selection  Gibbs  sampler. 
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The  code  uses  the  null  model  with  only  the  intercept  a  as  a  reference,  based 
on  the  integrated  likelihood  intlikeO  as  above.  It  then  starts  at  random  in 
the  collection  of  models: 

gamma=rep (0 , niter) 
mcur= sample (c (0 , 1) ,p,replace=TRUE) 
gamma [1] =sum(2~ (0 : (p— 1 ) ) *mcur) +1 
pcur=sum (mcur ) 

and  computes  the  corresponding  integrated  likelihood  inti  ike  cur 

if  (pcur==0)  intlikecur=intlikeO  else{  #integrated  likelihood 
Xcur=X [, which (mcur==l)] 

Ucur=solve  (t  (Xcur )  0/0*0/0Xcur )  0/0*0/0t  (Xcur ) 
betatildecur=bl=Ucur0/o*0/oX0/o*°/obetatilde 
betamlcur=b2=Ucur0/o*0/0y 

s2cur=t  (y-alphaml-Xcur0/o*0/ob2)0/o*0/o(y-alphaml-Xcur0/o*°/ob2) 
kappacur=as  .numeric (s2cur+t  (bl-b2)7o*7ot  (Xcur)70*°/0 
Xcur°/o*7o  (bl-b2)  /  (g+1)) 

intlikecur= (g+1) ~ (-pcur/2) *kappacur~ (-(n-l)/2) 

> 

It  then  proceeds  according  to  Algorithm  3.5,  proposing  to  change  one  variable 
indicator  7 j  and  accepting  this  move  with  a  Metropolis-Hastings  (defined  and 
justihed  in  Chap.  4)  probability: 

if  (runif (1) <= (intlikeprop/ intlikecur) ) 

This  modification  is  more  efficient  than  directly  simulating  from  the  condi¬ 
tional  as  it  avoids  proposing  the  same  value  for  7 j  twice. 

for  (t  in  1 : (niter-1) ) {  #iteration  index 
mprop=mcur 
j=sample (1 :p, 1) 
mprop [j] =abs (mcur [j] -1) 
pprop=sum (mprop) 

if  (pprop==0)  intlikeprop=intlikeO  else{  #integrated 
likelihood  Xprop=X [, which (mprop==l)] 

Uprop=solve  (t  (Xprop)  7o*7oXprop)  7o*7ot  (Xprop) 
b  e  t  at  i  1  d  e  p  r  op =b  1  =Upr  op  7o  *  7oX7o  *  7ob  e  t  at  i  1  d  e 
betamlprop=b2=Uprop7o*7oy 

s2prop=t  (y-alphaml-Xprop7o*7obetamlprop)  7o* 
7o(y-alphaml-Xprop7o*7obetamlprop) 
kappaprop=as  .numeric (s2prop+t  (betatildeprop-betamlprop)7o* 
7ot  (Xprop)  7o*7oXprop7o*7o 
(betatildeprop-betamlprop) / (g+1) ) 
intlikeprop=(g+l)  "  (-pprop/2)  *kappaprop'%  (-(n-1)  /2) 

> 

if  (runif (1) <=(intlikeprop/intlikecur) ) { 
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mcur=mprop 

intlikecur=intlikeprop 

> 


gamma [t+1] =sum(2~ (0 : (p— 1 ) )*mcur)+l 

> 

gamma=gamma [20001  miter]  #20,000  burnin  steps 
res=as . data. frame (table (as . factor (gamma) ) ) 
odo=order (res$Freq) [length (res$Freq) : (length(res$Freq) -9) ] 
modcho=res$Varl  [odo] 

probtoplO=res$Freq [odo] / (niter-20000) 


In  this  setting  of  caterpillar,  handling  only  eight  (potential)  explana¬ 
tory  variables  means  that  it  is  possible  to  compute  all  of  the  28  probabilities 
7r(7|y)  and  to  thus  deduce  the  normalizing  constant  in  (3.7).  We  can  therefore 
compare  these  exact  values  with  the  approximations  produced  by  the  Gibbs 
sampler.  Using  To  =20,000  and  To  =80,000,  i.e.  a  total  of  105  simulations, 
we  obtain  the  following  results  for  the  top  five  models: 


Models 

PostProb 

Gibbs  estimates 

of  the  PostProb 

1 

1 

2 

7 

0.0767 

0.0740 

2 

1 

7 

0.0689 

0.0675 

3 

1  2 

3 

7 

0.0686 

0.0668 

4 

1 

3 

7 

0.0376 

0.0376 

5 

1 

2 

6 

0.0369 

0.0370 

The  comparison  is  quite  comforting  for  the  Gibbs  sampler  as  the  differences 
are  truly  minor!  Rather  naturally,  as  the  number  of  variables  grows,  the  num¬ 
ber  of  simulations  needed  to  provide  a  good  approximation  grows  as  well.  Once 
more,  we  recommend  running  the  code  several  times  (with  different  random 
sequences)  to  ensure  the  stability  of  the  approximation. 


3.7  Exercises 


3.1  Show  that  the  matrix  Z  is  of  full  rank  if  and  only  if  the  matrix  ZTZ  is  in¬ 
vertible  (where  ZT  denotes  the  transpose  of  the  matrix  Z,  which  can  be  produced 


in  R  using  the  t(Z)  command).  Apply  to  Z  =  [ln  X 
cannot  happen  when  p  +  1  >  n. 


and  deduce  that  this 


3.2  Show  that  solving  the  minimization  program 

min  (y  -  X/3)T(y  -  X/3) 

r* 
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requires  solving  the  system  of  equations  (XTX)/3  =  XTy.  Check  that  this  can 
be  done  via  the  R  command  solve  (t  (X)0/0*0/0(X)  ,t  (X)0/0*0/0y) . 

3.3  Show  that  the  variance  of  the  maximum  likelihood  estimator  of  /?  in  the 
regression  model  is  given  by  V(/3| a2)  =  cr2(XTX)_1. 

3.4  For  the  model 

y |/3,  ct2  ~  Jfn  (X/3,<t2I„) 

a  conjugate  prior  distribution  is  as  follows:  the  conditional  distribution  of  /3  is 
given  by 

P\c2  ~  <t2M_1)  , 

where  M  is  a  (p,p)  positive  definite  symmetric  matrix,  and  the  marginal  prior  on 
a2  is  an  inverse  Gamma  distribution 

a2  ~  6),  a,  b  >  0  . 

Taking  advantage  of  the  matrix  identities 

(M  +  XTX) _1  =  M-1  -  M_1  (M-1  +  (XTX)“1) _1  M-1 

=  (XTX)“1  -  (XTX)“1  (M”1  +  (X1^)-1)-1  (X^)"1 


and 


XtX(M  +  XtX)_1M  =  (M_1(M  +  XTX)(XTX)_1) 

=  (M-1  +(XTX)-1)-1  , 

establish  that 

/3|y,  cr2  ~jVp  ((M  +  XtX)“1{(XtX)/3  +  M/3},<t2(M  +  XtX)-1)  (3.8) 


where  /3  =  (XTX)  1XTy  and 


C 7 


n 


y  —  —  +  a,  b  +  —  + 


s2  (P  —  /3)T  (M_1  +  (XTX)-1)  1(P~P) 


(3.9) 

/\  __  /\ 

where  s2  =  (y  —  /3X)T(y  —  /3X)  are  the  correct  posterior  distributions.  Give  a 
(1  —  a)  HPD  region  on  /3. 


3.5  The  regression  model  of  Exercise  3.4  can  also  be  used  in  a  predictive  sense: 
for  a  given  (ra,p+  1)  explanatory  matrix  X,  i.e.,  when  predicting  m  unobserved 
variates  jji,  the  corresponding  outcome  y  can  be  inferred  through  the  predictive 
distribution  7r(y|cr2,y).  Show  that  7r(y|cr2,y)  is  a  Gaussian  density  with  mean 


Ew  [y|cr2,  y]  =X(M  +  XtX)“1(XtX/3  +  M/3) 
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and  covariance  matrix 


V^(y|a2,  y)  =  a2( Im  +  X(M  +  XTX)~1XT) . 


Deduce  that 


y|y  ~  (ra  +  2a,  X(M  +  XtX)~1(XtX/3  +  M/3), 


26  +  s2  +  (/3  —  /3)T  (M-1  +  (XTX)~1)  10-$) 

n  +  2a 

x  {lm  +  X(M  +  XTX)-1XT})  . 


3.6  Show  that  the  marginal  distribution  of  y  associated  with  (3.8)  and  (3.9)  is 
given  by 

y~^„  (  2a,  X/3,  —(In  +  XM-1XT) 

\  a 

3.7  Show  that  the  matrix  (In  +  ^X(XTX)_1XT)  has  1  and  g  +  1  as  only 
eigenvalues.  (Hint:  Show  that  the  eigenvectors  associated  with  g  +  1  are  of  the 
form  X/3  and  that  the  eigenvectors  associated  with  1  are  those  orthogonal  to 
X.)  Deduce  that  the  determinant  of  the  matrix  (In  +  gX(XTX)-1XT)  is  indeed 

(0  +  l)p+1- 

3.8  Under  the  Jeffreys  prior,  give  the  predictive  distribution  of  y,  m  dimensional 
vector  corresponding  to  the  (m,p)  matrix  of  explanatory  variables  X. 

3.9  If  (xi,X2)  is  distributed  from  the  uniform  distribution  on 

{(xi,x2);  (xi  -  l)2  +  (x2  -  l)2  <  l}u{(xi,x2);  (xi  +  l)2  +  (x2  +  l)2  <  1}  , 


show  that  the  Gibbs  sampler  does  not  produce  an  irreducible  chain.  For  this  dis¬ 
tribution,  find  an  alternative  Gibbs  sampler  that  works.  (Hint:  Consider  a  rotation 
of  the  coordinate  axes.) 

3.10  If  a  joint  density  ^(2/1 , 2/2)  corresponds  to  the  conditional  distributions 
9i(yi\V2)  and  g2{y2\yi),  show  that  it  is  given  by 


s(2/i,  2/2) 


g2(2/2|2/i) 

f  92(v\yi)/gi(yi\v)  dv' 


3.11  Considering  the  model 


~  Sin(n,  6) , 


6  ~  Se(a,  6), 


derive  the  joint  distribution  of  (77,  6)  and  the  corresponding  full  conditional  distri¬ 
butions.  Implement  a  Gibbs  sampler  associated  with  those  full  conditionals  and 
compare  the  outcome  of  the  Gibbs  sampler  on  6  with  the  true  marginal  distribu¬ 
tion  of  0. 
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3.12  Take  the  posterior  distribution  on  (0,cr2)  associated  with  the  joint  model 


Xi 


6 ,  a 2  ~  a2),  i  =  1, . . . ,  n, 

0~^K(0o,r2),  ex2  ~  I&(a,  b) . 


Show  that  the  full  conditional  distributions  are  given  by 


e 


x,  <j-  ~  (  -5-7- — 5-  0O  + 

-f  nrz 


nr^ 


a2r2 


X 


cr2  +  nr2  ’  a 2  +  nr2 


and 

—  $)2  +  b'j  ? 

where  T  is  the  empirical  average  of  the  observations.  Implement  the  Gibbs  sampler 
associated  with  these  conditionals. 
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This  was  the  sort  of  thing  that  impressed 
Rebus:  not  nature,  but  ingenuity. 

Ian  Rankin,  A  Question  of  Blood. — 


Roadmap 

Generalized  linear  models  are  extensions  of  the  linear  regression  model  described 
in  the  previous  chapter.  In  particular,  they  avoid  the  selection  of  a  single  transfor¬ 
mation  of  the  data  that  must  achieve  the  possibly  conflicting  goals  of  normality 
and  linearity  imposed  by  the  linear  regression  model,  which  is  for  instance  impossi¬ 
ble  for  binary  or  count  responses.  The  trick  that  allows  both  a  feasible  processing 
and  an  extension  of  linear  regression  is  first  to  turn  the  covariates  into  a  real 
number  by  a  linear  projection  and  then  to  transform  this  value  so  that  it  fits  the 
support  of  the  response.  We  focus  here  on  the  Bayesian  analysis  of  probit  and 
logit  models  for  binary  data  and  of  log-linear  models  for  contingency  tables. 

On  the  methodological  side,  we  present  a  general  MCMC  method,  the 
Metropolis-Hastings  algorithm,  which  is  used  for  the  simulation  of  complex  dis¬ 
tributions  where  both  regular  and  Gibbs  sampling  fail.  This  includes  in  particular 
the  random  walk  Metropolis-Hastings  algorithm,  which  acts  like  a  plain  vanilla 
MCMC  algorithm. 


J.-M.  Marin  and  C.P.  Robert,  Bayesian  Essentials  with  R ,  Springer  Texts 
in  Statistics,  DOI  10. 1007/978- 1-4614-8687-9_4, 

©  Springer  Science+Business  Media  New  York  2014 
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4.1  A  Generalization  of  the  Linear  Model 

4.1.1  Motivation 

In  the  previous  chapter,  we  modeled  the  connection  between  a  response  vari¬ 
able  y  and  a  vector  x  of  explanatory  variables  by  a  linear  dependence  relation 
with  normal  perturbations.  There  are  many  instances  where  both  the  linear¬ 
ity  and  the  normality  assumptions  are  not  appropriate,  especially  when  the 
support  of  y  is  restricted  to  M+  or  N.  For  instance,  in  dichotomous  models, 
y  takes  its  values  in  {0, 1}  as  it  represents  the  indicator  of  occurrence  of  a 
particular  event  (death  in  a  medical  study,  unemployment  in  a  socioeconomic 
study,  migration  in  a  capture-recapture  study,  etc.);  in  this  case,  a  linear  con¬ 
ditional  expectation  E[?/|x,  (3\  =  xT/3  would  be  fairly  cumbersome  to  handle, 
both  in  terms  of  the  constraints  on  /3  and  the  corresponding  distribution  of 
the  error  e  =  y  —  E[?/|x,  (3\. 


CO 

=3 

4—* 

CC 


CD 


V  8  9  1011  12 

Bottom  margin  width  (mm) 


Status 


Fig.  4.1.  Dataset  bank:  (left)  Plot  of  the  status  indicator  versus  the  bottom  margin 
width;  (right)  boxplots  of  the  bottom  margin  width  for  both  counterfeit  statuses 


The  bank  dataset  we  analyze  in  the  first  part  of  this  chapter  comes  from 
Flury  and  Riedwyl  (1988)  and  is  made  of  four  measurements  on  100  genuine 
Swiss  banknotes  and  100  counterfeit  ones.  The  response  variable  y  is  thus  the 
status  of  the  banknote,  where  0  stands  for  genuine  and  1  stands  for  counterfeit, 
while  the  explanatory  factors  are  the  length  of  the  bill  x\,  the  width  of  the 
left  edge  #2,  the  width  of  the  right  edge  £3,  and  the  bottom  margin  width  £4, 
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all  expressed  in  millimeters.  We  want  a  probabilistic  model  that  predicts  the 
type  of  banknote  (i.e.,  that  detects  counterfeit  banknotes)  based  on  the  four 
measurements  above.  To  motivate  the  introduction  of  the  generalized  linear 
models,  we  only  consider  here  the  dependence  of  y  on  the  fourth  measure, 
£4,  which  again  is  the  bottom  margin  width  of  the  banknote.  To  start,  the 
yi  s  being  binary,  the  conditional  distribution  of  y  given  X4  cannot  be  normal. 
Nonetheless,  as  shown  by  Fig.  4.1,  the  variable  X4  clearly  has  a  strong  influence 
on  whether  the  banknote  is  or  is  not  counterfeit.  To  model  this  dependence  in 
a  proper  manner,  we  must  devise  a  realistic  (if  not  real!)  connection  between 
y  and  x±.  The  fact  that  y  is  binary  implies  a  specific  form  of  dependence:  In¬ 
deed,  both  its  marginal  and  conditional  distributions  necessarily  are  Bernoulli 
distributions.  This  means  that,  for  instance,  the  conditional  distribution  of  y 
given  X4  is  a  Bernoulli  &(p{pc 4))  distribution;  that  is,  for  X4  =  #4$,  there  exists 
0  <  pi  =  p(x4i)  <  1  such  that 


p  (yi  =  l\x4  =  X4i)  =  Pi 


which  turns  out  to  be  also  the  conditional  expectation  of  y^  E [yi\x4i\.  If  we 
do  impose  a  linear  dependence  on  the  pi  s,  namely, 


P(x4i)  =  Po  +  PlX4i  , 

the  maximum  likelihood  estimates  of  /3q  and  are  then  equal  to  —2.02  and 
0.268,  leading  to  the  estimated  prediction  equation 


Pi  =  —2.02  +  0.268x^4  .  (4.1) 

This  implies  that  a  banknote  with  bottom  margin  width  equal  to  8  is  coun¬ 
terfeit  with  probability 


-2.02  +  0.268  x  8  =  0.120. 

Thus,  this  banknote  has  a  relatively  small  probability  of  having  been  counter¬ 
feited,  which  coincides  with  the  intuition  drawn  from  Fig.  4.1.  However,  if  we 
now  consider  a  banknote  with  bottom  margin  width  equal  to  12,  (4.1)  implies 
that  this  banknote  is  counterfeited  with  probability 

-2.02  +  0.268  x  12  =  1.192, 

which  is  certainly  embarrassing  for  a  probability  estimate!  We  could 
try  to  modify  the  result  by  truncating  the  probability  to  (0,  1)  and  by  decid¬ 
ing  that  this  value  of  x±  almost  certainly  indicates  a  counterfeit,  but  still  there 
is  a  fundamental  difficulty  with  this  model.  The  fact  that  an  ordinary  linear 
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dependence  can  predict  values  outside  (0,1)  suggests  that  the  connection  bet¬ 
ween  this  explanatory  variable  and  the  probability  of  a  counterfeit  cannot  be 
modeled  through  a  linear  function  but  rather  can  be  achieved  using  functions 
of  x^i  that  take  their  values  within  the  interval  (0, 1). 


4.1.2  Link  Functions 


As  shown  by  the  previous  analysis,  while  linear  models  are  nice  to  work  with, 
they  also  have  strong  limitations.  Therefore,  we  need  a  broader  class  of  models 
to  cover  various  dependence  structures.  The  class  selected  for  this  chapter  is 
called  the  family  of  generalized  linear  models  (GLM),  which  has  been  formal¬ 
ized  in  McCullagh  and  Nelder  (1989).  This  nomenclature  stems  from  the  fact 
that  the  dependence  of  y  on  x  is  partly  linear  in  the  sense  that  the  conditional 
distribution  of  y  given  x  is  defined  in  terms  of  a  linear  combination  xT/3  of 
the  components  of  x, 

y\(3  ~  /(y|xT/3)  • 

As  in  the  previous  chapter,  we  use  the  notation  y  =  (2/1, . . . ,  yn)  for  a 
sample  of  n  responses  and 


X  = 


X\\  X\2  •  •  •  Xik 
X21  x22  • • •  x2k 
x31  x32  •  •  •  x3k 


xnl  xn2 


xnk 


for  the  n  x  k  matrix  of  corresponding  explanatory  variables,  possibly  with 
Xu  =  . . .  =  xni  =  1.  We  use  y  and  x  as  generic  notations  for  single- response 
and  covariate  vectors,  respectively.  Once  again,  we  will  omit  the  dependence 
on  x  or  X  to  simplify  notations. 

A  generalized  linear  model  is  specified  by  two  functions: 


1.  a  conditional  density  /  of  y  given  x  that  belongs  to  an  exponential  family 
(Sect.  2.2.3)  and  that  is  parameterized  by  an  expectation  parameter  y  = 
/i(x)  =  E[2/|x]  and  possibly  a  dispersion  parameter  ip  >  0  that  does  not 
depend  on  x;  and 

2.  a  link  function  g  that  relates  the  mean  y  =  /x(x)  of  /  and  the  covariate 
vector,  x,  as  g(y)  =  (xT/3),  /3  E  Rk . 


For  identifiability  reasons,  the  link  function  g  is  a  one-to-one  function  and  we 
have 

E[y\f3,(p]  =  g _1  (xT/3)  . 

We  can  thus  write  the  (conditional)  likelihood  as 


^(/3,  v>|y)  =  II  f  (yilx*T/3’  d 

i— 1 
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if  we  choose  to  reparameterize  /  with  the  transform  g(pi)  of  its  mean  and  if 
we  denote  by  yd  the  covariate  vector  for  the  zth  observation. 

The  ordinary  linear  regression  is  obviously  a  special  case  of  GLM  where 
g(x)  =  x,  (p  =  a2  and  y\/3,cr2  ~  Ah  (xT/3,cr2).  However,  outside  the  linear 
model,  the  interpretation  of  the  coefficients  fd;L  is  much  more  delicate  because 
these  coefficients  do  not  relate  directly  to  the  observables,  due  to  the  presence 
of  a  link  function  that  cannot  be  the  identity.  For  instance,  in  the  logistic 
regression  model  (defined  in  the  following  paragraph),  the  linear  dependence 
is  defined  in  terms  of  the  log-odds  ratio  log{pi/(l  —  pi)}. 

The  most  widely  used  GLMs  are  presumably  those  that  analyze  binary 
data,  as  in  bank,  that  is,  when  yi  &(l,Pi)  (with  ^  =  pi  =  p(xlT  (3)).  The 
mean  function  p  thus  transforms  a  real  value  into  a  value  between  0  and  1, 
and  a  possible  choice  of  link  function  is  the  logit  transform , 


g(p)  =  iog{p/(i  -p)}, 


associated  with  the  logistic  regression  model  Because  of  the  limited  support 
of  the  responses  there  is  no  dispersion  parameter  in  this  model  and  the 
corresponding  likelihood  function  is 


n 


|y)  =  n 


Vi 


1 


exp(x2  (3) 

1  +  exp(x2_r/3)  J  \  1  +  exp(x2_r/3) 


Vi 


It  thus  fails  to  factorize  conveniently  because  of  the  denominator:  there  is  no 
manageable  conjugate  prior  for  this  model,  called  the  logit  model 

There  exists  a  specific  form  of  link  function  for  each  exponential  family 
which  is  called  the  canonical  link.  This  canonical  function  is  chosen  as  the 
function  g*  of  the  expectation  parameter  that  appears  in  the  exponent  of  the 
natural  exponential  family  representation  of  the  probability  density,  namely 


9*(p)=0  if  f(y\p,<p)  =h(y)exptp{T(y)-0-'P(6)}. 
Since  the  logistic  regression  model  can  be  written  as 


f(yi\Pi)  =  exp 


Vi  log 


+  log(l  -  Pi ) 


1 


the  logit  link  function  is  the  canonical  version  for  the  Bernoulli  model.  Note 
that,  while  it  is  customary  to  use  the  canonical  link,  there  is  no  compelling 
reason  to  do  so,  besides  following  custom! 


xThis  upper  indexing  allows  for  the  distinction  between  Xi,  the  ith  component 
of  the  covariate  vector,  and  x\  the  ith  vector  of  covariates  in  the  sample. 
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For  binary  response  variables,  many  link  functions  can  be  substituted  for 
the  logit  link  function.  For  instance,  the  probit  link  function,  g(fii)  =  <P_1(/q), 
where  <P  is  the  standard  normal  cdf,  is  often  used  in  econometrics.  The  corre¬ 
sponding  likelihood  is 


KP\y)  p  &{xtT{3)Vi 

i— 1 


1  -  <£(xiT/3)] 1  v 


(4.3) 


Although  this  alternative  is  also  quite  arbitrary  and  any  other  cdf  could  be 
used  as  a  link  function  (such  as  the  logistic  cdf  associated  with  (4.2)),  the 
probit  link  function  enjoys  a  missing-data  (Chap.  6)  interpretation  that  clearly 
boosted  its  popularity:  This  model  can  indeed  be  interpreted  as  a  degraded 
linear  regression  model  in  the  sense  that  observing  yi  =  1  corresponds  to 
the  case  z\  >  0,  where  Z{  is  a  latent  (that  is,  unobserved)  variable  such  that 
Zi  ^  gF  (x%t/3,  l).  In  other  words,  y  =  1  (zi  >  0)  appears  as  a  dichotomized 
linear  regression  response.  Of  course,  this  perspective  is  only  an  interpretation 
of  the  probit  model  in  the  sense  that  there  may  be  no  hidden  zd  s  at  all  in  the 
real  world!  In  addition,  the  probit  and  logistic  regression  models  have  quite 
similar  behaviors,  differing  mostly  in  the  tails. 

Another  type  of  GLM  deals  with  unbounded  integer- valued  variables.  The 
Poisson  regression  model  starts  from  the  assumption  that  the  yd s  are  Poisson 
&(nd)  and  it  selects  a  link  function  connecting  M+  bijectively  with  R,  such 
as,  for  instance,  the  logarithmic  function,  g{jid)  =  log (/q).  This  model  is  thus 
a  count  model  in  the  sense  that  the  responses  are  integers,  for  instance  the 
number  of  deaths  due  to  lung  cancer  in  a  county  or  the  number  of  speeding 
tickets  issued  on  a  particular  stretch  of  highway,  and  it  is  quite  common  in 
epidemiology.  The  corresponding  likelihood  is 


exp  {yi  x*T/3  -  exp(x*T/3)} 


5 


where  the  factorial  terms  (1  /yd)  are  irrelevant  for  both  likelihood  and  poste¬ 
rior  computations.  Note  that  it  does  not  factorize  conveniently  because  of  the 
exponential  terms  within  the  exponential. 

The  three  examples  above  are  simply  illustrations  of  the  versatility  of 
generalized  linear  modeling.  In  this  chapter,  we  discuss  only  two  types  of 
data  for  which  generalized  linear  modeling  is  appropriate.  We  refer  the  reader 
to  McCullagh  and  Nelder  (1989)  and  Gelman  et  al.  (2013)  for  a  much  more 
detailed  coverage. 


4.2  Metropolis— Hastings  Algorithms 

As  partly  hinted  by  the  previous  examples,  posterior  inference  in  GLMs  is 
much  harder  than  in  linear  models  because  of  less  manageable  (and  non¬ 
factorizing)  likelihoods,  which  explains  the  longevity  and  versatility  of  linear 
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model  studies  over  the  past  centuries!  Working  with  a  GLM  typically  requires 
specific  numerical  or  simulation  tools.  We  take  the  opportunity  of  this  require¬ 
ment  to  introduce  a  universal  MCMC  method  called  the  Metropolis-Hastings 
algorithm.  Its  range  of  applicability  is  incredibly  broad  (meaning  that  it  is  by 
no  means  restricted  to  GLM  applications)  and  its  inclusion  in  the  Bayesian 
toolbox  in  the  early  1990s  has  led  to  considerable  extensions  of  the  Bayesian 
field.2 


4.2.1  Definition 

When  compared  with  the  Gibbs  sampler,  Metropolis-Hastings  algorithms  are 
generic  (or  off-the-shelf)  MCMC  algorithms  in  the  sense  that  they  can  be 
tuned  toward  a  much  wider  range  of  possibilities.  Those  algorithms  are  also 
a  natural  extension  of  standard  simulation  algorithms  such  as  accept-reject 
(see  Chap.  5)  or  sampling  importance  resampling  methods  since  they  are  all 
based  on  a  proposal  distribution.  However,  a  major  difference  is  that,  for  the 
Metropolis-Hastings  algorithms,  the  proposal  distribution  is  Markov,  with 
kernel  density  q(x,  y).  If  the  target  distribution  has  density  i r,  the  Metropolis- 
Hastings  algorithm  is  as  follows: 


Algorithm  4.6  Generic  Metropolis-Hastings  Sampler 

Initialization:  Choose  an  arbitrary  starting  value  x^°\ 

Iteration  t  (t  >  1): 

1.  Given  generate  x  ~  q{x<^t~1\x). 

2.  Compute 


p(x ^  1\x)  =  mm 


tt{x)  /  q{xM  x\x) 

7T  (xd-1))/g(x,  W-1)) 


3.  With  probability  p{x ^  ^,x),  accept  x  and  set  aW  =  x\ 
otherwise  reject  x  and  set  aW  = 


The  distribution  q  is  also  called  the  instrumental  distribution.  As  in  the 
accept-reject  method  (Sect.  5.4),  we  only  need  to  know  either  tt  or  q  up  to  a 
proportionality  constant  since  both  constants  cancel  in  the  calculation  of  p. 
Note  also  the  advantage  of  this  approach  compared  with  the  Gibbs  sampler: 
it  is  not  necessary  to  use  the  conditional  distributions  of  tt. 

The  strong  appeal  of  this  algorithm  is  that  it  is  rather  universal  in  its 
formulation  as  well  as  in  its  use.  Indeed,  we  only  need  to  simulate  from  a 


2 This  algorithm  had  been  used  by  particle  physicists,  including  Metropolis,  since 
the  late  1940s,  but,  as  is  often  the  case,  the  connection  with  statistics  was  not  made 
until  much  later! 
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proposal  q  that  can  be  chosen  quite  freely.  There  is,  however,  a  theoretical 
constraint,  namely  that  the  chain  produced  by  this  algorithm  must  be  able  to 
explore  the  support  of  i r(y)  in  a  finite  number  of  steps.  As  discussed  below, 
there  also  are  many  practical  difficulties  that  are  such  that  the  algorithm  may 
lose  its  universal  feature  and  that  it  may  require  some  specific  tuning  for  each 
new  application. 

The  theoretical  validation  of  this  algorithm  is  the  same  as  with  other 
MCMC  algorithms:  The  target  distribution  n  is  the  limiting  distribution  of 
the  Markov  chain  produced  by  Algorithm  4.6.  This  is  due  to  the  choice  of  the 
acceptance  probability  p(x,  y)  since  the  so-called  detailed  balance  equation 

■jr(x)q(x,y)p(x,y)  =  n(y)q(y,x)p(y,x) 

holds  and  thus  implies  that  7 r  is  stationary  by  integrating  out  x. 

While  theoretical  guarantees  that  the  algorithm  converges  are  very  high, 
the  choice  of  q  remains  essential  in  practice.  Poor  choices  of  q  may  indeed 
result  either  in  a  very  high  rejection  rate,  meaning  that  the  Markov  chain 
(x^)t  hardly  moves,  or  in  a  myopic  exploration  of  the  support  of  7 r,  that 
is,  in  a  dependence  on  the  starting  value  x^  such  that  the  chain  is  stuck 
in  a  neighborhood  region  of  x^\  A  particular  choice  of  proposal  q  may  thus 
work  well  for  one  target  density  but  be  extremely  poor  for  another  one.  While 
the  algorithm  is  indeed  universal,  it  is  impossible  to  prescribe  application- 
independent  strategies  for  choosing  q. 

We  thus  consider  below  two  specific  cases  of  proposals  and  briefly  discuss 
their  pros  and  cons  (see  Robert  and  Casella,  2004,  Chap.  7,  for  a  detailed 
discussion). 


4.2.2  The  Independence  Sampler 

The  choice  of  q  closest  to  the  accept-reject  method  (see  Algorithm  5.9)  is  to 
pick  a  constant  q  that  is  independent  of  its  first  argument, 

q(x,y )  =  q(y)  ■ 


In  that  case,  p  simplifies  into 


p(x,  y )  =  min 


Tr(x)/q(x)J 


In  the  special  case  in  which  q  is  proportional  to  i r,  we  obtain  p[x ,  y)  =  1  and 
the  algorithm  reduces,  as  expected,  to  iid  sampling  from  i r.  The  analogy  with 
the  accept-reject  algorithm  is  that  the  maximum  of  the  ratio  n/q  is  replaced 
with  the  current  value  i t(x  (t  1))/q{x{t  _1))  but  the  sequence  of  accepted  aW’s 
is  not  iid  because  of  the  acceptance  step. 

The  convergence  properties  of  the  algorithm  depend  on  the  density  q. 
First,  q  needs  to  be  positive  everywhere  on  the  support  of  it.  Second,  for  good 
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exploration  of  this  support,  it  appears  that  the  ratio  n/q  needs  to  be  bounded 
(see  Robert  and  Casella,  2004,  Theorem  7.8).  Otherwise,  the  chain  may  take 
too  long  to  reach  some  regions  with  low  q/ir  values.  This  constraint  obviously 
reduces  the  appeal  of  using  an  independence  sampler,  even  though  the  fact 
that  it  does  not  require  an  explicit  upper  bound  on  ir/q  may  sometimes  be  a 
plus. 

This  type  of  MH  sampler  is  thus  very  model-dependent,  and  it  suffers  from 
the  same  drawbacks  as  the  importance  sampling  methodology,  namely  that 
tuning  the  “right”  proposal  becomes  much  harder  as  the  dimension  increases. 

4.2.3  The  Random  Walk  Sampler 

Since  the  independence  sampler  requires  too  much  global  information  about 
the  target  distribution  that  is  difficult  to  come  by  in  complex  or  high¬ 
dimensional  problems,  an  alternative  is  to  opt  for  a  local  gathering  of  infor¬ 
mation,  clutching  to  the  hope  that  the  accumulated  information  will  provide, 
in  the  end,  the  global  picture.  Practically,  this  means  exploring  the  neigh¬ 
borhood  of  the  current  value  in  search  of  other  points  of  interest.  The 
simplest  exploration  device  is  based  on  random  walk  dynamics. 

A  random  walk  proposal  is  based  on  a  symmetric  transition  kernel  q(x,  y )  = 
qRw(y  ~  x)  with  qRw(x)  =  Qrw(~x)-  Symmetry  implies  that  the  acceptance 
probability  p(x,  y )  reduces  to  the  simpler  form 

p(x,y)  =  min  (l,n(y)/n(x))  . 

The  appeal  of  this  scheme  is  obvious  when  looking  at  the  acceptance  proba¬ 
bility,  since  it  only  depends  on  the  target  i r  and  since  this  version  accepts  all 
proposed  moves  that  increase  the  value  of  i r.  There  is  considerable  flexibility 
in  the  choice  of  the  distribution  qRw-,  at  least  in  terms  of  scale  (i.e. ,  the  size 
of  the  neighborhood  of  the  current  value)  and  tails.  Note  that  while  from  a 
probabilistic  point  of  view  random  walks  usually  have  no  stationary  distri¬ 
bution,  the  algorithm  biases  the  random  walk  by  moving  toward  modes  of  it 
more  often  than  moving  away  from  them. 

The  ambivalence  of  MCMC  methods  like  the  Metropolis-Hastings  algo¬ 
rithm  is  that  they  can  be  applied  to  virtually  any  target.  This  is  a  terrific  plus 
in  that  they  can  tackle  new  models,  but  there  is  also  a  genuine  danger  that 
they  simultaneously  fail  to  converge  and  fail  to  signal  that  they  have  failed  to 
converge!  Indeed,  these  algorithms  can  produce  seemingly  reasonable  results, 
with  all  outer  aspects  of  stability,  while  they  are  missing  major  modes  of  the 
target  distribution.  For  instance,  particular  attention  must  be  paid  to  models 
where  the  number  of  parameters  exceeds  by  far  the  size  of  the  dataset. 

4.2.4  Output  Analysis  and  Proposal  Design 

An  important  problem  with  the  implementation  of  an  MCMC  algorithm  is  to 
gauge  when  convergence  has  been  achieved;  that  is,  to  assess  at  what  point 
the  distribution  of  the  chain  is  sufficiently  close  to  its  asymptotic  distribution 
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for  all  practical  purposes  or,  more  practically,  when  it  has  covered  the  whole 
support  of  the  target  distribution  with  sufficient  regularity.  The  number  of 
iterations  Tq  that  is  required  to  achieve  this  goal  is  called  the  burn-in  period. 
It  is  usually  sensible  to  discard  simulated  values  within  this  burn-in  period 
in  the  Monte  Carlo  estimation  so  that  the  bias  caused  by  the  starting  value 
is  reduced.  However,  and  this  is  particularly  true  in  high  dimensions,  the 
empirical  assessment  of  MCMC  convergence  is  extremely  delicate,  to  the  point 
that  it  is  rarely  possible  to  be  certain  that  an  algorithm  has  converged.3 
Nevertheless,  some  partial  convergence  diagnostic  procedures  can  be  found 
in  the  literature  (see  Robert  and  Casella,  2004,  Chap.  12,  and  Robert  and 
Casella,  2009,  Chap.  8).  In  particular,  the  latter  describes  the  R  package  coda 
in  Sect.  8.2.4. 

A  first  way  to  assess  whether  or  not  a  chain  is  in  its  stationary  regime 
is  to  visually  compare  trace  plots  of  sequences  started  at  different  values,  as 
it  may  expose  difficulties  related,  for  instance,  to  multimodality.  In  practice, 
when  chains  of  length  T  from  two  starting  values  have  visited  substantially 
different  parts  of  the  state  space,  the  burn-in  period  for  at  least  one  of  the 
chains  should  be  greater  than  T.  Note,  however,  that  the  problem  of  obtaining 
overdispersed  starting  values  can  be  difficult  when  little  is  known  about  the 
target  density,  especially  in  large  dimensions. 

Autocorrelation  plots  of  particular  components  provide  in  addition  good 
indications  of  the  chain’s  mixing  behavior.  If  pk  (k  G  N*)  denotes  the  fcth- 
order  autocorrelation, 


pk  =  cov 


•> 


these  quantities  can  be  estimated  from  the  observed  chain  itself,4  at  least  for 
small  values  of  /c,  and  an  effective  sample  size  factor  can  be  deduced  from 
these  estimates, 


where  pk  is  the  empirical  autocorrelation  function.  This  quantity  represents 
the  sample  size  of  an  equivalent  iid  sample  when  running  T  iterations.  Con¬ 
versely,  the  ratio  T/Tess  indicates  the  multiplying  factor  on  the  minimum 
number  of  iid  iterations  required  to  run  a  simulation.  Note,  however,  that  this 
is  only  a  partial  indicator:  Chains  that  remain  stuck  in  one  of  the  modes  of 
the  target  distribution  may  well  have  a  high  effective  ratio. 

While  we  cannot  discuss  at  length  the  selection  of  the  proposal  distribu¬ 
tion  (see  Robert  and  Casella,  2004,  Chap.  7,  and  Robert  and  Casella,  2009, 

3 Guaranteed  convergence  as  in  accept-reject  algorithms  is  sometimes  achievable 
with  MCMC  methods  using  techniques  such  as  perfect  sampling  or  renewal.  But 
such  techniques  require  a  much  more  advanced  study  of  the  target  distribution  and 
the  transition  kernel  of  the  algorithm.  These  conditions  are  not  met  very  often  in 
practice  (see  Robert  and  Casella  2004,  Chap.  13). 

4In  R,  this  estimation  can  be  conducted  using  the  acf  function. 
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Chap.  6),  we  stress  that  this  is  an  important  choice  that  has  deep  consequences 
for  the  convergence  properties  of  the  simulated  Markov  chain  and  thus  for  the 
exploration  of  the  target  distribution.  As  for  prior  distributions,  we  advise 
the  simultaneous  use  of  different  kernels  to  assess  their  performances  on  the 
run.  When  considering  a  random  walk  proposal,  for  instance,  a  quantity  that 
needs  to  be  calibrated  against  the  target  distribution  is  the  scale  of  this  ran¬ 
dom  walk.  Indeed,  if  the  variance  of  the  proposal  is  too  small  with  respect  to 
the  target  distribution,  the  exploration  of  the  target  support  will  be  small  and 
may  fail  in  more  severe  cases.  Similarly,  if  the  variance  is  too  large,  this  means 
that  the  proposal  will  most  often  generate  values  that  are  outside  the  support 
of  the  target  and  that  the  algorithm  will  reject  a  large  portion  of  attempted 
transitions. 


^  It  seems  reasonable  to  tune  the  proposal  distribution  in  terms  of  its  past  per¬ 
formances,  for  instance  by  increasing  the  variance  if  the  acceptance  rate  is  high 
or  decreasing  it  otherwise  (or  moving  the  location  parameter  toward  the  mean 
estimated  over  the  past  iterations).  This  must  not  be  implemented  outside  a 
burn-in  step,  though,  because  a  permanent  modification  of  the  proposal  dis¬ 
tribution  amounts  to  taking  into  account  the  whole  past  of  the  sequence  and 
thus  it  cancels  both  its  Markovian  nature  and  its  convergence  guarantees. 


Consider,  solely  for  illustration  purposes,  the  standard  normal  distribution 
Ab(0, 1)  as  a  target.  If  we  use  Algorithm  4.6  with  a  normal  random  walk,  i.e., 


x 


X 


1 


the  performance  of  the  sampler  depends  on  the  value  a.  An  R  function  that 
implements  the  associated  Hastings-Metropolis  sampler  is  coded  as 

hm=function(n,xO , sigma2) { 
x=rep(xO ,n) 
for  (i  in  2:n){ 

y=rnorm(l ,x [i— 1] , sqrt (sigma2) ) 

if  (runif  (l)<=exp(-0.5*(y''2-x[i-l]  ~2)))  x[i]=y 

else  x[i]=x[i-l] 

} 

x 

} 

For  instance,  picking  a2  equal  to  either  10-4  or  103  provides  two  extreme 
cases:  As  shown  in  Fig.  4.2,  the  chain  has  a  high  acceptance  rate  but  a  low 
exploration  ability  and  a  high  autocorrelation  in  the  former  case,  while  its 
acceptance  rate  is  low  but  its  ability  to  move  around  the  normal  range  is 
high  in  the  latter  case  (with  a  quickly  decreasing  autocorrelation).  Both 
cases  use  the  “wrong  scale”,  though,  in  that  the  histograms  of  the  simula¬ 
tion  outputs  are  quite  far  from  the  target  distribution  after  10,000  iterations, 
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and  this  indicates  that  a  much  larger  number  of  iterations  must  be  used. 
A  comparison  with  Fig.  4.3,  which  corresponds  to  a  =  1,  clearly  makes  this 
point  but  also  illustrates  the  fact  that  the  large  variance  still  induces  large 
autocorrelations. 
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Fig.  4.2.  Simulation  of  a  tT(0,  1)  target  with  (left)  a  J\f(x,  10  4)  and  (right) 
a  c/F(x,  103)  random  walk  proposal.  Top :  Sequence  of  10,000  iterations;  middle : 
histogram  of  the  last  2,000  iterations  compared  with  the  target  density;  bottom : 
empirical  autocorrelations  using  R  function  plot.acf 


Fig.  4.3.  Same  legend  as  Fig.  4.2  for  a  TT(x,  1)  random  walk  proposal 
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Several  MCMC  algorithms  can  be  mixed  together  within  a  single  algorithm 
using  either  a  circular  or  a  random  design.  While  this  construction  is  often 
suboptimal  (in  that  the  inefficient  algorithms  in  the  mixture  are  still  used  on 
a  regular  basis),  it  almost  always  brings  an  improvement  compared  with  its 
individual  components.  A  special  case  where  a  mixed  scenario  is  used  is  the 
Metropolis- within- Gibbs  algorithm:  When  building  a  Gibbs  sampler,  it  may 
happen  that  it  is  difficult  or  impossible  to  simulate  from  one  or  several  of 
the  conditional  distributions.  In  that  case,  a  single  Metropolis  step  associated 
with  this  conditional  distribution  (as  its  target)  can  be  used  instead.5 


4.3  The  Probit  Model 

We  now  engage  in  a  full  discussion  of  the  Bayesian  processing  of  the  probit 
model  introduced  in  Sect.  4.1,  taking  special  care  to  distinguish  between  the 
various  types  of  prior  modeling. 


4.3.1  Flat  Prior 


If  no  prior  information  is  available,  we  can  resort  (as  usual!)  to  a  default  flat 
prior  on  /3,  tt(/3)  oc  1,  and  then  obtain  the  posterior  distribution 


7r(/3 |y)  oc  @(x.lT /3)Vi 

i— 1 


1  -  <?>(xiT/3)] 1  v 


1 


which  is  nonstandard  and  must  be  simulated  using,  e.g.,  MCMC  techniques. 
First,  the  log-likelihood  function  is  computable,  as  shown  by  the  following  R 
code6: 

probitll=funct ion (beta ,y ,  X) { 

#  probit  likelihood 

if  (is .matrix(beta)==F)  beta=as .matrix (t (beta) ) 
n=dim(beta)  [1] 
pll=rep (0  ,n) 
for  (i  in  l:n){ 

lFl=pnorm(X°/o*0/0beta  [i ,]  ,log=T) 
lF2=pnorm(-X°/o*70beta[i ,]  ,log=T) 

5We  stress  that  we  do  not  resort  to  an  MH  algorithm  for  the  purpose  of  simulat¬ 
ing  exactly  from  the  corresponding  conditional  since  this  would  require  an  infinite 
number  of  iterations  but  rather  that  we  use  a  single  iteration  of  the  MH  algorithm 
as  a  substitute  for  the  simulation  from  the  conditional  since  the  resulting  MCMC 
algorithm  is  still  associated  with  the  same  stationary  distribution. 

The  use  of  the  is. matrix  test  ensures  that  the  function  can  be  computed  at  one 
point  as  well  as  on  multiple  points  and  thus  allows  for  calls  from  plot  and  other 
graphical  functions. 
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pll [i] =sum(y*lFl+(l-y) *1F2) 

> 

pll 

> 

A  variety  of  Metropolis-Hastings  algorithms  have  been  proposed  for  obtain¬ 
ing  samples  from  this  posterior  distribution.  Here  we  consider  a  sampler  that 
appears  to  work  well  when  the  number  of  predictors  is  reasonably  small.  This 
Metropolis-Hastings  sampler  is  a  random  walk  scheme  that  uses  the  maxi- 
mum  likelihood  estimate  (3  as  a  starting  value  and  the  asymptotic  (Fisher) 
covariance  matrix  A  of  the  maximum  likelihood  estimate  as  the  covariance 
matrix  for  the  proposal  density,  /3  ^(/3(*-V2£). 


Algorithm  4.7  Probit  Metropolis-Hastings  Sampler 

/V  /s 

Initialization:  Compute  the  MLE  (3  and  the  covariance  matrix  A  corre¬ 
sponding  to  the  asymptotic  covariance  of  /3,  and  set  (3 ^  =  (3. 

Iteration  t  >  1: 

1.  Generate  (3  ~  M/c(/3^_1\  r2  A). 

2.  Compute 


p{(3[t  1\(3)  =  min  (l, 7r(/3|y)/7r(/3(t  1}|y))  • 

3.  With  probability  p(/3^_1\/ 3),  take  (3^  =  /3; 
otherwise  take  (3^  =  f3^~l\ 


The  R  function  glm  is  obviously  quite  helpful  in  setting  the  initialization 
step  of  Algorithm  4.7.  The  step  used  in  the  R  code  to  scale  the  algorithm  is 
based  on 

>  mod= summary (glm(y~X,f amily=binomial (link= "probit ") ) ) 

with  mod$coef f  [ ,  1]  corresponding  to  (3  and  mod$cov .unsealed  to  A.  The 
following  code  is  then  reproducing  the  above  algorithm  in  R:: 

hmf latprobit=f unction (niter ,y,X, scale) { 
p=dim(X)  [2] 

mod=summary (glm (y 1+X , f amily=binomial (link= "probit " ) ) ) 

beta=matr ix (0 , niter , p) 

beta [1 , ] =as . vector (mod$coeff [ , 1]  ) 

Sigma2=as . matrix (mod$cov . unsealed) 

A  choice  of  parameters  that  depend  on  the  data  for  the  Metropolis-Hastings 
proposal  is  completely  valid,  both  from  an  MCMC  point  of  view  (meaning  that 
this  is  not  a  self-tuning  algorithm)  and  from  a  Bayesian  point  of  view  (since  the 
parameters  of  the  proposal  are  not  those  of  the  prior). 
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for  (i  in  2:niter){ 

tildebeta=rmnorm(l ,beta [i— 1 , ] , scale*Sigma2) 
llr=probitll (tildebeta,y ,X) -probit 11 (beta[i-l , ] ,y ,X) 
if  (runif (1) <=exp(llr) )  beta [i , ] =tildebeta 
else  beta [i , ] =beta [i— 1 , ] 

} 

beta 

} 

It  takes  advantage  of  the  multivariate  normal  generator  rmnorm,  part  of  the 
package  mnormt  that  caters  to  the  multivariate  normal  distribution. 

For  bank,  using  a  probit  modeling  with  no  intercept  over  the  four  mea¬ 
surements,  we  tested  three  different  scales,  namely  r  =  1,0.1, 10,  by  running 
Algorithm  4.7  over  10,000  iterations.  Looking  both  at  the  raw  sequences  and 
at  the  autocorrelation  graphs,  it  appears  that  the  best  mixing  behavior  is 
associated  with  r  —  1.  Figure  4.4  illustrates  the  output  of  the  simulation  run 
in  that  case.8  Using  a  burn-in  range  of  1,000  iterations,  the  averages  of  the 
parameters  over  the  last  9,000  iterations  are  equal  to  —1.2193,0.9540,0.9795, 
and  1.1481,  respectively.  A  plug-in  estimate  of  the  predictive  probability  of  a 
counterfeit  banknote  is  therefore 

Pi  =  <P  (—1.2193x^1  +  0.9540x^2  +  0.9795^3  +  1.1481^4) . 

For  instance,  according  to  this  equation,  a  banknote  of  length  214.9  mm,  left- 
edge  width  130.1mm,  right-edge  width  129.9  mm,  and  bottom  margin  width 
9.5  mm  is  counterfeited  with  probability 

#  (-1.1293  x  214.9  +  . . .  +  1.1481  x  9.5)  «  0.5917. 


While  the  plug-in  representation  above  gives  an  immediate  evaluation  of 
the  predictive  probability,  a  better  approximation  to  this  probability  function 
is  provided  by  the  average  over  the  iterations  of  the  current  predictive  proba¬ 
bilities,  (j3^ xn  +  xi2  +  %i3  +  Xi±J  .  It  is  easily  derived  from  the 

output  of  the  hmf  latprobit  function. 


4.3.2  Noninformat ive  G-Priors 

Following  the  principles  discussed  in  earlier  chapters  (see,  e.g.,  Chap.  3),  a  flat 
prior  on  (3  is  not  appropriate  for  comparison  purposes  since  we  cannot  validate 
the  corresponding  Bayes  factors.  In  a  variable  selection  setup,  we  thus  need 
to  replace  the  flat  prior  with,  e.g.,  a  hierarchical  prior, 

8  We  do  not  include  the  graphs  for  the  other  values  of  r,  but  the  curious  reader 
can  check  that  there  is  indeed  a  clear  difference  with  the  case  r  —  1. 
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Fig.  4.4.  Dataset  bank:  Estimation  of  the  probit  coefficients  via  Algorithm  4.7  and 
a  flat  prior.  Left:  /Vs  (i  —  1, ...  ,4);  center,  histogram  over  the  last  9,000  iterations; 
right  autocorrelation  over  the  last  9,000  iterations 


f3\cr2  ~  Af/c  (Ofc,  cr2(XTX)  -1)  and  7r(cr2)  oc  a  2, 

inspired  by  the  normal  linear  regression  model.9  Integrating  out  a2  in  this 
joint  prior  then  leads  to 


tt(/3) 


oc 


X 


/  \  —k/2 

X|1/2r(/c/2)  (/3T(XTX)/3j  n~k/2 


which  is  clearly  improper.  Nonetheless,  if  we  consider  the  same  hierarchical 
prior  for  a  submodel  associated  with  a  subset  of  the  predictor  variables  in  X, 
associated  with  the  same  variance  factor  cr2,  the  marginal  distribution  of  y 
then  depends  on  the  same  unknown  multiplicative  constant  as  the  full  model, 
and  this  constant  cancels  in  the  corresponding  Bayes  factor.  This  is  exactly 
the  same  idea  as  for  Zellner’s  noninformat ive  (T-prior,  see  Sect.  3.4.3. 

The  corresponding  posterior  distribution  of  (3  is 


7r(/3|y)  oc  |XTX|1/2r(/c/2)  f/3T(XTX)/3 


—  k/2 


7 r 


-k/2 


9Note  that  the  matrix  XTX  is  not  the  Fisher  information  matrix  outside  of  the 
normal  model.  However,  the  (genuine)  Fisher  information  matrix  usually  involves  a 
function  of  (3  that  prevents  its  use  as  a  prior  (inverse)  covariance  matrix  on  f3. 
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x  <P(xiJ(3)yi 

i— 1 


1  -  <£(xiT/3)] 1  v 


Note  that  we  need  to  keep  the  “constant”  terms  |XTX|1/2,  r(k/ 2),  and  7r_/i:/2, 
in  this  expression  because  they  vary  among  submodels.  To  omit  these  terms 
would  thus  result  in  a  bias  in  the  computation  of  the  Bayes  factors. 

Contrary  to  the  linear  regression  setting  and  as  for  the  flat  prior  in 
Sect.  4.3.1,  neither  the  posterior  distribution  of  f3  nor  the  marginal  distri¬ 
bution  of  y  can  be  derived  analytically.  We  can  however  use  exactly  the  same 
Metropolis-Hastings  sampler  as  in  Sect.  4.3.1,  namely  a  random  walk  proposal 
based  on  the  estimated  Fisher  information  matrix  for  its  scale  and  the  MLE 
/3  as  its  starting  value. 


For  bank,  the  corresponding  approximate  Bayes  estimate  of  (3  is  given  by 

E^ly]  ~  (-1.1552,0.9200,0.9121,1.0820), 

which  slightly  differs  from  the  estimate  found  in  Sect.  4.3.1  for  the  hat  prior. 
This  approximation  was  obtained  by  running  the  MH  algorithm  with  scale 
r2  =  1  over  10,000  iterations  and  averaging  over  the  last  9,000  iterations. 
Figure  4.5  gives  an  assessment  of  the  convergence  of  the  MH  scheme  that 
does  not  vary  very  much  compared  with  the  previous  figure. 


We  now  address  the  specific  problem  of  approximating  the  marginal  dis¬ 
tribution  of  y  toward  providing  approximations  to  the  Bayes  factor  and  thus 
achieve  the  Bayesian  equivalent  of  standard  software  to  identify  significant 
variables  in  the  probit  model.  The  marginal  distribution  of  y  is 


/(y)  oc  |XTX|1/27r“fc/2r(/c/2)  J  (/ 3T(XTX)/3 


-/c/2 


n 


x  y[  <Z>(xiT/3)»  [1  -  <2>(xiT/3)] 1  Vi  d/3 , 


i—  1 


which  cannot  be  computed  in  closed  form.  We  thus  propose  to  use  as  a  generic 
proxy  an  importance  sampling  approximation  to  this  integral  based  on  a  nor- 
mal  approximation  2  V)  to  tt(/3 |y ) ,  where  /3  is  the  MCMC  approxima¬ 

tion  of  E77  [/3|y]  and  V  is  the  MCMC  approximation  0  of  ¥(/3|y).  The  corre¬ 
sponding  estimate  of  the  marginal  distribution  of  y  is  then,  up  to  a  constant, 


10 The  factor  2  in  the  covariance  matrix  allows  some  amount  of  overdispersion, 
which  is  always  welcomed  in  importance  sampling  settings,  if  only  for  variance  finite¬ 
ness  purposes. 
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Fig.  4.5.  Dataset  bank:  Same  legend  as  Fig.  4.4  using  an  MH  algorithm  and  a 
G-prior  on  /3 


XTX|1/2 
7 rfc/2AL 


(XTX)/3(to) 


-ft/2 


n 


n 

i=  1 


<P(xiT/3(ro) 


Vi 


4>(xlT/3(m) 


) 


1  1-2/i 


X  g(^(m>  -^)Tr _1  (/3(m)  —P)/4  ;  (4.5) 

/  \  ^  /\ 

where  the  /3[m)1s  are  simulated  from  the  Gf/c(/3,  2  V)  importance  distribution. 

If  we  consider  a  linear  restriction  on  (3  such  as  H0  :  R/3  =  r,  with  r  gR9 
and  R  di  q  x  k  matrix  of  rank  <7,  the  submodel  is  associated  with  the  likelihood 


n 

^(/3°|y)  °c  n  ^(xoT/3°)w 


1  -  $(xg/3°) 


2/i 


2=1 

where  /3°  is  (fc  —  <7)-dimensional  and  Xq  and  xo  are  linear  transforms  of  X 
and  x  of  dimensions  (n,  k  —  q)  and  (k  —  </),  respectively.  Under  the  G-prior 

/3°|cr2  -  Jkk-q  (Ofc-g,  cr2(XjX0)_1)  and  7r(cr2)  oc  cr-2  , 

the  marginal  distribution  of  y  is  of  the  same  type  as  in  the  unconstrained 
case,  namely, 
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/( y)  °c  |XjX0|1/27r“(fe“9)/2r{(A:  -  q)/2}  /  {(/3°)T(XjX0)/30} 


01  ~{k~q)/2 


n 


x  ]4  <Z>(x^T /3°)Vi  [1  -  <2>(xj,T/30)] 1  Vi  d/3° 


i— 1 


Once  again,  if  we  first  run  an  MCMC  sampler  for  the  posterior  of  /3°  for  this 
submodel,  it  provides  both  parameters  of  a  normal  importance  distribution 
and  thus  allows  an  approximation  of  the  marginal  distribution  of  y  in  the 
submodel  in  all  ways  similar  to  (4.5). 

For  bank,  if  we  want  to  test  the  null  hypothesis  Hq  :  (3i  =  /?2  =  0,  we 
obtain  the  Bayes  factor  T>f0  =  8916.0  via  the  importance  sampling  approxima¬ 
tion  of  (4.5).  We  use  the  following  R  commands,  which  again  borrow  functions 
like  dmnorm  and  rmnorm  from  the  package  mnormt, 

#  full  model 

mkprob=apply (noinf probit , 2 , mean) 
vkprob=var (noinf probit) 
s imk=rmnorm ( 1 00000 ,mkprob , 2*vkprob) 
usk=probitnoinf lpost (simk , y , X [ , 2 : 5] ) - 
dmnorm ( s imk , mkpr ob , 2*vkprob , log=T ) 

#  null  model 

noinfprobitO=hmnoinfprobit (10000 ,y, X [ ,4 : 5]  ,1) 
mk0=apply (noinf probitO , 2 , mean) 
vk0=var (noinf probitO) 
simk0=rmnorm (100000 , mkO , 2*vk0) 
usk0=probitnoinf lpost (simkO ,y,X[,4:5])- 
dmnorm(simk0 ,mk0 , 2*vk0 , log=T) 

#  Bayes  factor 

bf 0probit=mean(exp(usk) ) /mean (exp (uskO) ) 

Using  Jeffreys’  scale  of  evidence,  since  log  10(BJr0)  =  3.950,  the  posterior  dis¬ 
tribution  is  strongly  against  Hq. 


More  generally,  we  can  produce  a  Bayesian  regression  output,  programmed 
in  R,  that  mimics  the  standard  software  output  for  generalized  linear  models. 
Along  with  the  estimates  of  the  /Vs,  given  by  their  posterior  expectation,  we 
include  the  posterior  variances  of  the  /Vs,  als°  derived  from  the  MCMC  sam¬ 
ple,  and  the  log  Bayes  factors  log10  {B J0)  corresponding  to  the  null  hypotheses 
Hq  :  [3i  =  0.  As  above,  the  Bayes  factors  are  computed  by  importance  sam¬ 
pling  based  on  100,000  simulations.  The  stars  are  related  to  Jeffreys’  scale  of 
evidence. 
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For  bank,  the  corresponding  outcome  is 


Estimate 

Post.  var. 

loglO(BF) 

XI 

-1.1552 

0.0631 

4.5844  (****) 

X2 

0.9200 

0.3299 

-0.2875 

X3 

0.9121 

0.2595 

-0.0972 

X4 

1.0820 

0.0287 

15.6765  (****) 

evidence  against  HO: 
(**)  substantial,  (*) 

(****)  decisive,  (***)  strong, 
poor 

Although  these  Bayes  factors  cannot  be  used  simultaneously,  an  informal  con¬ 
clusion  is  that  the  significant  variables  for  the  identification  of  counterfeited 
banknotes  are  X\  and  X4. 


4.3.3  About  Informative  Prior  Analyses 

In  the  setting  of  probit  (and  other  generalized  linear)  models,  it  is  unrealis¬ 
tic  to  expect  practitioners  to  come  up  with  precise  prior  information  about 
the  parameters  /3.  There  exists  nonetheless  an  amenable  approach  to  prior 
information  through  what  is  called  the  conditional  mean  family  of  prior  dis¬ 
tributions.  The  intuition  behind  this  approach  is  that  prior  beliefs  about  the 
probabilities  pi  can  be  assessed  to  some  extent  by  the  practitioners  for  par¬ 
ticular  values  of  the  explanatory  variables  Xu , . . . ,  Xki-  Once  this  information 
is  taken  into  account,  a  corresponding  prior  can  be  derived  for  the  parameter 
vector  (3.  This  technique  is  certainly  one  of  the  easiest  methods  of  incorporat¬ 
ing  subjective  prior  information  into  the  processing  of  the  binary  regression 
problem,  especially  because  it  appeals  to  practitioners  for  whom  the  /3’s  have, 
at  best,  a  virtual  meaning. 

Starting  with  k  explanatory  variables,  we  derive  the  subjective  prior  infor¬ 
mation  from  k  different  values11  of  the  covariate  vector,  denoted  by  x1, . . . ,  xfc. 
For  each  of  these  values,  the  practitioner  is  asked  to  specify  two  things: 

1.  a  prior  guess  gi  at  the  probability  of  success  pi  associated  with  x2;  and 

2.  an  assessment  of  her  or  his  certainty  about  that  guess  translated  as  a  num¬ 
ber  Ki  of  equivalent  “prior  observations.”  12  This  question  can  be  expressed 
as  “On  how  many  imaginary  observations  did  you  build  this  guess?” 

Both  quantities  can  be  turned  into  a  formal  prior  density  on  /3  by  imposing 
a  beta  prior  distribution  on  pi  with  parameters  Kj(ji  and  Ki(  1  —  gf)  since 
the  mean  of  a  k$e(a,b)  distribution  is  a/ (a  +  b).  If  we  make  the  additional 

11  The  theoretical  motivation  for  setting  the  number  of  covariate  vectors  equal  to 
the  dimension  of  (3  will  be  made  clear  below. 

12 This  technique  is  called  the  device  of  imaginary  observations  and  was  proposed 
by  the  Italian  statistician  Bruno  de  Finetti  for  prior  elicitation. 
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assumption  that  the  k  probabilities  Pi, ...  ,Pk  are  a  priori  independent  (which 
clearly  does  not  hold  since  they  all  depend  on  the  same  /?!),  their  joint  den¬ 
sity  is 

k 

7r(pi,  •  •  -,Pk)  oc 11  pfiSi-1(l  -  pi)Ki^~9i)~l .  (4.6) 

2=1 

Now,  if  we  relate  the  probabilities  pi  to  the  parameter  /3,  conditional  on  the 
covariate  vectors  x1, . . . ,  xfe,  by  pi  =  <£(x*T/3),  we  conclude  that  the  corre¬ 
sponding  distribution  on  (3  is 


k 


7 r(/3)  cx  P  [1  -  <2>(xlT/3)] 


-iT  ^i(!  — Pi)  — 1 


^(Vt/3) 


2=1 


This  change  of  variable  explains  why  we  needed  exactly  k  different  covariate 
vectors  in  the  prior  assessment. 

This  intuitive  approach  to  prior  modeling  is  also  interesting  from  a  com¬ 
putational  point  of  view  since  the  corresponding  posterior  distribution 


7r(/3 |y)  oc  ^(x2_r /3)Vi 

2=1 


1  -  <£(xiT/3)] 1  y 


k 

x  H 

3  =  1 


[l-p(jyT/3)]*'(1~w)_1  <£(Vt/3) 


is  of  almost  exactly  the  same  type  as  the  posterior  distributions  in  both  non- 
informative  modelings  above.  The  main  difference  stands  in  the  product  of 
the  Jacobian  terms  ^(xjT/3)  (1  <  j  <  fc),  but 


k 

n*  xjT/3)  oc  exp 

3  = 1 


(5c-7T/3)2/2  I  =  exp 


y^  x^T 

3  = 1 


means  that,  if  we  forget  about  the  —  l’s  in  the  exponents,  this  posterior 
distribution  corresponds  to  a  regular  posterior  distribution  for  the  probit 
model  when  adding  to  the  observations  (i/i,  x1 ),...,  (yn,  xn)  the  pseudo¬ 
observations13  (<7i,  x1 ),...,  (gi,  x1), ... ,  xfc), . . . ,  xfc),  where  each  pair 
(<7i,xz)  is  repeated  Kj  times  and  when  using  the  G-prior 


(3  ~  Jkk 


k 

Ofc, 

y^  xjxjT 

j=i 

j 

Therefore,  Algorithm  4.7  need  not  be  adapted  to  this  case. 

13Note  that  the  fact  that  the  gj ’s  do  not  take  their  values  in  {0, 1}  but  rather  in 
(0, 1)  does  not  create  any  difficulty  in  the  implementation  of  Algorithm  4.7. 
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4.4  The  Logit  Model 

We  now  reproduce  some  of  the  developments  of  the  previous  section  in  the  case 
of  the  logit  model,  as  defined  in  Sect.  4.1.2,  not  because  there  exist  notable 
differences  with  either  the  processing  or  the  conclusions  of  the  probit  model 
but  rather  because  there  is  hardly  any  difference!  For  instance,  Algorithm  4.7 
can  also  be  used  for  this  model,  while  based  on  the  same  proposal,  by  simply 
modifying  the  definition  of  7r(/3 |y) ,  since  the  likelihood  is  now 

{n  \  j  n 

XNxG 3  >  /  [l  +  exp(xlT/3)]  .  (4.7) 

i= 1  J  '  i—1 

The  R  function  that  computes  the  log-likelihood  of  the  logit  model  is 

logit ll=f unction (bet a, y ,X) { 
if  (is .matrix(beta)==F)  beta=as .matrix (t (beta) ) 
n=dim(beta)  [1] 
pll=rep (0  ,n) 
for  (i  in  l:n){ 

lFl=plogis  (X7o*70beta [i , ]  , log=T) 
lF2=plogis  (-X°/o*0/0beta [i , ]  ,  log=T) 
pll [i] =sum(y*lFl+(l-y) *1F2) 

> 

pll 

> 

That  both  models  can  be  processed  in  a  very  similar  manner  means,  for  in¬ 
stance,  that  they  can  be  easily  compared  when  one  is  uncertain  about  which 
link  function  to  adopt.  The  Bayes  factor  used  in  the  comparison  of  the  probit 
and  logit  models  is  directly  derived  from  the  importance  sampling  experi¬ 
ments  described  for  the  probit  model.  Note  also  that,  while  the  values  of 
the  parameter  /3  differ  between  the  two  models,  a  subjective  prior  modeling 
as  in  Sect.  4.3.3  can  be  conducted  simultaneously  for  both  models,  the  only 
difference  occurring  for  the  change  of  variables  from  (pi, . . .  ,p&)  to  f3. 

If  we  use  a  flat  prior  on  /3,  the  posterior  distribution  proportional  to  (4.7) 
can  be  inserted  directly  in  Algorithm  4.7  to  produce  a  sample  approximately 
distributed  from  this  posterior  (assuming  it  exists,  which  means  observing  a 
sufficiently  large  and  diverse  sample).  The  corresponding  R  code  is 

hmflat logit=funct ion (niter ,y,X, scale) { 
p=dim(X) [2] 

mod= summary (glm (y 1+X , f amily=binomial (link= "logit " ) ) ) 

beta=matrix(0 , niter ,p) 

betafl ,] =as .vector (mod$coeff [, 1] ) 

Sigma2=as . matrix (mod$cov . unsealed) 
for  (i  in  2:niter){ 

tildebeta=rmvn(l ,beta[i-l ,] , scale*Sigma2) 
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Fig.  4.6.  Dataset  bank:  Estimation  of  the  logit  coefficients  via  Algorithm  4.7  under 
a  flat  prior.  Left:  /Vs  {i  —  1, ...  ,4);  center :  histogram  over  the  last  9,000  iterations; 
right:  autocorrelation  over  the  last  9,000  iterations 


llr=logitll (tildebeta,y ,X) -logit 11 (beta [i— 1 , ] ,y ,X) 
if  (runif  (1)  <=exp  (Hr) )  beta  [i  ,  ]  =tildebeta 
else  beta [i , ] =beta [i— 1 ,  ] 

} 

beta 

} 


For  bank,  Fig.  4.6  summarizes  the  results  of  running  Algorithm  4.7  with 
the  scale  factor  equal  to  r  =  1:  There  is  no  clear  difference  between  these 
graphs  and  those  of  earlier  figures,  except  for  a  slight  increase  in  the  skew¬ 
ness  of  the  histograms  of  the  ft’s.  (Obviously,  this  does  not  necessarily 
reflect  a  different  convergence  behavior  but  possibly  a  different  posterior  be¬ 
havior  since  we  are  not  dealing  with  the  same  posterior  distribution.)  The  MH 
approximation — based  on  the  last  9,000  iterations — of  the  Bayes  estimate  of 
(3  is  equal  to  (—2.5888, 1.9967,  2.1260,  2.1879).  We  can  note  the  numerical  dif¬ 
ference  between  these  values  and  those  produced  by  the  probit  model.  The 
sign  and  the  relative  magnitudes  of  the  components  are,  however,  very  simi¬ 
lar.  For  comparison  purposes,  consider  the  plug-in  estimate  of  the  predictive 
probability  of  a  counterfeit  banknote, 
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exp  (—2.5888x^1  +  1.9967x^2  +  2.1260x^3  +  2.1879x^4) 

1  +  exp  (—2.5888x^1  +  1.9967x^2  +  2.1260x^3  +  2.1879x^4) 

Using  this  approximation,  a  banknote  of  length  214.9  mm,  of  left-edge  width 
130.1  mm,  of  right-edge  width  129.9  mm,  and  of  bottom  margin  width  9.5  mm 
is  counterfeited  with  probability 


exp  (-2.5888  x  130.1  +  . . .  +  2.1879  x  9.5) 

1  +  exp  (-2.5888  x  130.1  +  . . .  +  2.1879  x  9.5) 


0.5963. 


This  estimate  of  the  probability  is  therefore  very  close  to  the  estimate  de¬ 
rived  from  the  probit  modeling,  which  was  equal  to  0.5917  (especially  if  we 
take  into  account  the  uncertainties  associated  both  with  the  MCMC  experi¬ 
ments  and  with  the  plug-in  shortcut). 


For  model  comparison  purposes  and  the  computation  of  Bayes  factors, 
we  can  also  use  the  same  G-prior  as  for  the  probit  model  and  thus  multiply 

(4.7)  by  |XTX|1/2r(fc/2)  (/3T(XTX)/3J  Tr~k X  The  MH  implementation 

obviously  remains  the  same. 


For  bank,  Fig.  4.7  once  more  summarizes  the  output  of  the  MH  scheme 
over  10,000  iterations.  Since  we  observe  the  same  skewness  in  the  histograms 
as  in  Fig.  4.6,  this  feature  is  most  certainly  due  to  the  corresponding  posterior 
distribution  rather  than  to  a  deficiency  in  the  convergence  of  the  algorithm.) 

We  can  repeat  the  test  of  the  null  hypothesis  Hq  :  Pi  =  P2  =  0  already 
done  for  the  probit  model  and  then  obtain  an  approximate  Bayes  factor  of 
Fq0  =  16972.3,  with  the  same  conclusion  as  earlier  (although  with  twice  as 
large  an  absolute  value.  We  can  also  take  advantage  of  the  output  software 
programmed  for  the  probit  model  to  produce  the  following  summary: 


Estimate 

Post.  var. 

loglO(BF) 

XI 

-2.3970 

0.3286 

4.8084  (****) 

X2 

1.6978 

1.2220 

-0.2453 

X3 

2.1197 

1.0094 

-0.1529 

X4 

2.0230 

0.1132 

15.9530  (****) 

evidence 

against  HO : 

(****)  decisive,  (***)  strong. 

(**)  substantial,  (*)  poor 

Therefore,  the  most  important  covariates  are  again  X\  and  X4. 
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Fig.  4.7.  Dataset  bank:  Same  legend  as  Fig.  4.6  using  an  MH  algorithm  and  a 
G-prior  on  / 3 


4.5  Log-Linear  Models 

We  conclude  this  chapter  with  an  application  of  generalized  linear  modeling 
to  the  case  of  factors,  already  mentioned  in  Sect.  3.1.  A  standard  approach 
to  the  analysis  of  associations  (or  dependencies)  between  categorical  variables 
(that  is,  variables  that  take  a  finite  number  of  values)  is  to  use  log-linear 
models.  These  models  are  special  cases  of  generalized  linear  models  connected 
to  the  Poisson  distribution,  and  their  name  stems  from  the  fact  that  they  have 
traditionally  been  based  on  the  logarithmic  link  function. 

4.5.1  Contingency  Tables 

In  such  models,  a  sufficient  statistic  is  the  contingency  table ,  which  is  a 
multiple-entry  table  made  up  of  the  cross-classified  counts  for  the  different 
categorical  variables.  There  is  much  literature  on  contingency  tables,  including 
for  instance  Whittaker  (1990)  and  Agresti  (1996),  because  the  corresponding 
models  are  quite  handy  both  in  the  social  sciences  and  in  survey  processing, 
where  the  observables  are  always  reduced  to  a  finite  number  of  values. 
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The  airquality  dataset  was  obtained  from  the  New  York  State 
Department  of  Conservation  (ozone  data)  and  from  the  American  National 
Weather  Service  (meteorological  data)  and  is  part  of  the  datasets  contained 
in  R  (Chambers  et  ah,  1983)  and  available  as 

>  air=data (airquality) 

This  dataset  involves  two  repeated  measurements  over  111  consecutive  days 
of  1973,  namely  the  mean  ozone  u  (in  parts  per  billion)  from  1pm  to  3  pm 
at  Roosevelt  Island,  the  maximum  daily  temperature  v  (in  degrees  F)  at  La 
Guardia  Airport,  and,  in  addition,  the  month  w  (coded  from  5  for  May  to  9 
for  September).  If  we  discretize  the  measurements  u  and  v  into  dichotomous 
variables  (using  the  empirical  median  as  the  cutting  point),  we  obtain  the 
following  three-way  contingency  table  of  counts  per  combination  of  the  three 
(discretize)  factors: 

month  5  6  7  8  9 

ozone  temp 

[1,31]  [57,79]  17  4  2  5  18 

(79,97]  02332 

(31,168]  [57,79]  61031 

(79,97]  1  2  21  12  8 

This  contingency  table  thus  has  5  x  2  x  2  =  20  entries  deduced  from  the 
number  of  categories  of  the  three  factors,  among  which  some  are  zero  because 
the  corresponding  combination  of  the  three  factors  has  not  been  observed  in 
the  study. 


Each  term  in  the  table  being  an  integer,  it  can  then  in  principle  be  modeled 
as  a  Poisson  variable.  If  we  denote  the  counts  by  y  =  (7/1, . . . ,  yn),  where 
i  —  1, . . . ,  n  is  an  arbitrary  way  of  indexing  the  cells  of  the  table,  we  can  thus 
assume  that  yi  ~  Obviously,  the  likelihood 

n 

%|y)  =  II  —  Mf  exp (-m) , 

i= 1 

where  /1  =  (/ii, . . .  ,/in),  shows  that  the  model  is  saturated ,  namely  that  no 
structure  can  be  exhibited  because  there  are  as  many  parameters  as  there 
are  entries  in  the  table.  To  exhibit  any  structure,  we  need  to  constrain  the 
/Vs  and  do  so  via  a  GLM  whose  covariate  matrix  X  is  directly  derived  from 
the  contingency  table  itself.  If  some  entries  are  structurally  equal  to  zero  (as 
for  instance  when  crossing  “number  of  pregnancies”  with  “male  indicators”), 
these  entries  should  be  removed  from  the  model. 
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An  R  function  that  corresponds  to  this  log-linear  model  log-likelihood  is 

loglinll=function(beta,y ,X) { 

if  (is .matrix (beta) ==FALSE)  beta=as .matrix (t (beta) ) 
n=dim(beta)  [1] 
pll=rep (0 ,n) 
for  (i  in  l:n){ 

lF=exp  (X°/o*°/0beta  [i ,  ] ) 

pll [i] =sum(dpois (y , IF, log=T) ) 

} 

pll 

> 

with  again  the  use  of  is. matrix  and  as. matrix  to  allow  for  matricial  calls 
to  the  loglinll  function. 

When  we  constrain  the  mean  parameters  fii  of  a  log-linear  model  to  satisfy 

l°g  (Mi)  =  XM , 

the  covariate  vector  x?  is  rather  peculiar  in  that  it  is  constituted  only  of 
indicators.  The  so-called  incidence  matrix  X  with  rows  equal  to  the  x2,s  is 
thus  such  that  its  elements  are  all  zeros  or  ones.  Given  a  contingency  table, 
the  choice  of  indicator  variables  to  include  in  x1  can  vary,  depending  on  what 
is  deemed  (or  found)  to  be  an  important  relation  between  some  categorical 
variables.  For  instance,  suppose  that  there  are  three  categorical  variables,  rq 
iq  and  w  as  in  airquality,  and  that  u  takes  I  values,  v  takes  J  values,  and  w 
takes  K  values.  If  we  only  include  the  indicators  for  the  values  of  the  three 
categorical  variables  in  X,  we  have 

i  j  K 

log  (Mr)  =  X  Pb lb(Ur)  +  X  Pblb(vT)  +  ^  fi™Ib{wT)  ; 

6=1  6=1  6=1 

that  is,  (1  <  i  <  /,  1  <  j  <  J,  1  <  k  <  iL), 


l°g(/b(z,j,/c))  —  Pi  +  Pj  +  Pk 

(1  <  i  <  /,  1  <  j  <J,  1  <  k  <  K),  where  /(i,  j,  k)  corresponds  to  the  index 
of  the  (i,j,  k)  entry  in  the  table,  namely  the  case  when  u  =  i,  v  =  j ,  and 
w  =  k.  Similarly,  the  saturated  log-linear  model  corresponds  to  the  use  of  one 
indicator  per  entry  of  the  table;  that  is  1  <  z  <  /,  1  <  y  <  J,  1  <  fc  <  K), 

log (MJ(*,j,fc))  =  Pijk  • 

For  comparative  reasons  that  will  very  soon  become  apparent,  and  by 
analogy  with  analysis  of  variance  (ANOVA)  conventions,  we  can  also  over¬ 
parameterize  this  representation  as 

l°g(/b(zj,/c))  =  A  +  +  \Vj  +  +  Xfj  +  A +  Xv^  +  X^ukw  ,  (4.8) 
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where  A  appears  as  the  overall  or  reference  average  effect,  A^  appears  as 
the  marginal  discrepancy  (against  the  reference  effect  A)  when  u  =  i,  A ^ 
as  the  interaction  discrepancy  (against  the  added  effects  A  +  A^  +  AJ)  when 
(u,v)  =  (i i,j ),  etc. 

Using  the  representation  (4.8)  is  quite  convenient  because  it  allows  a 
straightforward  parameterization  of  the  nonsaturated  models,  which  then  ap¬ 
pear  as  submodels  of  (4.8)  where  some  groups  of  parameters  are  null.  For 
example, 

1.  if  both  categorical  variables  v  and  w  are  irrelevant,  then 

log  =  A  + A“; 

2.  if  all  three  categorical  variables  are  mutually  independent,  then 

l°g(/h(i,j,/c))  =  A  +  K  +  Xj  +  5 

3.  if  u  and  v  are  associated  but  are  both  independent  of  w,  then 

l°g(/h(i,y/c))  =  A  -f-  A^  -j-  AJ  +  A^  +  A ^  ; 

(iv)  if  u  and  v  are  conditionally  independent  given  w,  then 

l°g(/bq,j,/e))  =  A  +  A^  +  AJ  +  A^  +  A^  +  A ^  ;  and 

(v)  if  there  is  no  three-factor  interaction,  then 

log(^(M,u)  =  A  +  A?  +  AJ  +  A£  +  A-  +  AJST  +  A^  , 

which  appears  as  the  most  complete  submodel  (or  as  the  global  model  if 

the  saturated  model  is  not  considered  at  all). 

This  representation  naturally  embeds  log-linear  modeling  within  a  model 
choice  perspective  in  that  it  calls  for  a  selection  of  the  most  parsimonious  sub¬ 
model  that  remains  compatible  with  the  observations.  This  is  clearly  equiv¬ 
alent  to  a  variable-selection  problem  of  a  special  kind  in  the  sense  that  all 
indicators  related  with  the  same  association  must  remain  or  vanish  at  once. 
This  specific  feature  means  that  there  are  much  fewer  submodels  to  consider 
than  in  a  regular  variable-selection  problem. 

As  stressed  above,  the  representation  (4.8)  is  not  identifiable.  Although 
the  following  is  not  strictly  necessary  from  a  Bayesian  point  of  view  (since 
the  Bayesian  approach  can  handle  nonidentifiable  settings  and  still  estimate 
properly  identifiable  quantities),  it  is  customary  to  impose  identifiability  con¬ 
straints  on  the  parameters  as  in  the  ANOVA  model.  A  common  convention 
is  to  set  to  zero  the  parameters  corresponding  to  the  first  category  of  each 
variable,  which  is  equivalent  to  removing  the  indicator  (or  dummy  variable ) 
of  the  first  category  for  each  variable  (or  group  of  variables).  For  instance, 
for  a  2  x  2  contingency  table  with  two  variables  u  and  both  having  two 
categories,  say  1  and  2,  the  constraint  could  be 
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\u  \v  \uv  \uv  \uv  n 

Ai  —  Ai  —  A11  —  A12  —  A2]_  —  U  . 

For  notational  convenience,  we  assume  below  that  (3  is  the  vector  of  the  pa¬ 
rameters  once  the  identifiability  constraint  has  been  applied  and  that  X  is  the 
indicator  matrix  with  the  corresponding  columns  removed. 

4.5.2  Inference  Under  a  Flat  Prior 

Even  when  using  a  noninformative  flat  prior  on  /3,  tt(/3)  oc  1,  the  posterior 
distribution 

n 

7r(/3 |y)  oc  {exp(x^/3)}yi  exp{— exp(x^/3)} 

{n  n 

E  y i  xl+  -  F  exp(vT/3) 

i= 1  i= 1 

{/  n  \  n 

(X+x4  /3-Eexp(xlT/3) 

is  nonstandard  and  must  be  approximated  by  an  MCMC  algorithm.  While 
the  shape  of  this  density  differs  from  the  posterior  densities  in  the  probit  and 
logit  cases,  we  can  once  more  implement  Algorithm  4.7  based  on  the  normal 
Fisher  approximation  of  the  likelihood  (whose  parameters  are  again  derived 
using  the  R  glm()  function  as  in 

>  mod=summary (glm(y~-l+X,f amily=poisson() ) ) 

/\  /\ 

which  provides  (3  as  mod$coeff[,l]  and  U  as  mod$cov. unsealed). 

For  airquality,  we  first  consider  the  most  general  nonsaturated  model, 
as  described  in  Sect.  4.5.1.  Taking  into  account  the  identifiability  constraints, 
there  are  therefore 

l  +  (2  — 1) +  (2  —  1) +  (5  —  1) +  (2  —  1)  x  (2  —  1) +  (2  —  1)  x  (5  —  1) +  (2  —  1)  x  (5  —  1) , 

i.e.,  16,  free  parameters  in  the  model  (to  be  compared  with  the  20  counts 
in  the  contingency  table).  Given  the  dimension  of  the  simulated  parameter, 
it  is  impossible  to  provide  a  complete  picture  of  the  convergence  properties 
of  the  algorithm,  and  we  represented  in  Fig.  4.8  the  traces  and  histograms 
for  the  marginal  posterior  distributions  of  the  parameters  based  on  10,000 
iterations  using  a  scale  factor  equal  to  r2  =  0.5.  (This  value  was  obtained  by 
trial  and  error,  producing  a  smooth  trace  for  all  parameters.  Larger  values  of 
r  required  a  larger  number  of  iterations  since  the  acceptance  rate  was  lower, 
as  the  reader  can  check  using  the  BCoRe  package.)  Note  that  some  of  the 
traces  represented  in  Fig.  4.8  show  periodic  patterns  that  indicate  that  more 
iterations  could  be  necessary.  However,  the  corresponding  histograms  remain 
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Fig.  4.8.  Dataset  airquality:  Traces  (top)  and  histograms  ( bottom )  of  the  simulations 
from  the  posterior  distributions  of  the  components  of  / 3  using  a  flat  prior  and  a 
random  walk  Metropolis-Hastings  algorithm  with  scale  factor  r2  =  0.5  (same  order 
row- wise  as  in  Table  4.1) 
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Table  4.1.  Dataset  airquality:  Bayes  estimates  of  the  parameter  (3  using  a  random 
walk  MH  algorithm  with  scale  factor  r  =0.5 


Effect 

Post,  mean 

Post.  var. 

A 

2.8041 

0.0612 

V 

-1.0684 

0.2176 

-5.8652 

1.7141 

-1.4401 

0.2735 

\  w 

A3 

-2.7178 

0.7915 

\  w 

a4 

-1.1031 

0.2295 

\  w 

-0.0036 

0.1127 

A  UV 
A22 

3.3559 

0.4490 

\  uw 
a22 

-1.6242 

1.2869 

a23 

-0.3456 

0.8432 

\  uw 

A  24 

-0.2473 

0.6658 

\  uw 
a25 

-1.3335 

0.7115 

a22 

4.5493 

2.1997 

a23 

6.8479 

2.5881 

\  vw 

A  24 

4.6557 

1.7201 

\  vw 
a25 

3.9558 

1.7128 

quite  stable  over  iterations.  Both  the  approximated  posterior  means  and  the 
posterior  variances  for  the  16  parameters  as  deduced  from  the  MCMC  run  are 
given  in  Table  4.1.  A  few  histograms  in  Fig.  4.8  are  centered  at  0,  signaling  a 
potential  lack  of  significance  for  the  corresponding  /Vs. 


4.5.3  Model  Choice  and  Significance  of  the  Parameters 


If  we  try  to  compare  different  levels  of  association  (or  interaction),  or  if  we 
simply  want  to  test  the  significance  of  some  parameters  /V  the  flat  prior  is 
once  again  inappropriate.  The  G-prior  alternative  proposed  for  the  probit  and 
logit  models  is  still  available,  though,  and  we  can  thus  replace  the  posterior 
distribution  of  the  previous  section  with 


7r(/3|y)  cx  |XTX|1/2r(/c/2)  f/3T(XTX)/3 


—  k/2 


7 r 


-k/2 


exp 


n 


exP(VT/3) 


i—  1 


(4.9) 


as  an  alternative  posterior. 
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Table  4.2.  Dataset  airquality:  Metropolis-Hastings  approximations  of  the  posterior 
means  under  the  G-prior 


Effect 

Post,  mean 

Post.  var. 

A 

2.7202 

0.0603 

V 

-1.1237 

0.1981 

-4.5393 

0.9336 

-1.4245 

0.3164 

\  w 

A3 

-2.5970 

0.5596 

\  w 

a4 

-1.1373 

0.2301 

\  w 

0.0359 

0.1166 

A  UV 

A22 

2.8902 

0.3221 

\  uw 
a22 

-0.9385 

0.8804 

a23 

0.1942 

0.6055 

\  uw 

A  24 

0.0589 

0.5345 

\  uw 
a25 

-1.0534 

0.5220 

a22 

3.2351 

1.3664 

\  vw 
A23 

5.3978 

1.3506 

\  vw 

A  24 

3.5831 

1.0452 

\  vw 
a25 

2.8051 

1.0061 

For  airquality  and  the  same  model  as  in  the  previous  analysis,  namely  the 
maximum  nonsaturated  model  with  16  parameters,  Algorithm  4.7  can  be  used 
with  (4.9)  as  target  and  r2  =  0.5  as  the  scale  in  the  random  walk.  The  result 
of  this  simulation  over  10,000  iterations  is  presented  in  Fig.  4.9.  The  traces  of 
the  components  of  /3  show  the  same  slow  mixing  as  in  Fig.  4.8,  with  similar 
occurrences  of  large  deviances  from  the  mean  value  that  may  indicate  the  weak 
identifi ability  of  some  of  these  parameters.  Note  also  that  the  histograms  of  the 
posterior  marginal  distributions  are  rather  close  to  those  associated  with  the 
flat  prior,  as  shown  in  Fig.  4.8.  The  MCMC  approximations  to  the  posterior 
means  and  the  posterior  variances  are  given  in  Table  4.2  for  all  16  parameters, 
based  on  the  last  9,000  iterations.  While  the  first  parameters  are  quite  close  to 
those  provided  by  Table  4.1,  the  estimates  of  the  interaction  coefficients  vary 
much  more  and  are  associated  with  much  larger  variances.  This  indicates 
that  much  less  information  is  available  within  the  contingency  table  about 
interactions,  as  can  be  expected. 


If  we  now  consider  the  very  reason  why  this  alternative  to  the  flat  prior 
was  introduced,  we  are  facing  the  same  difficulty  as  in  the  probit  case  for 
the  computation  of  the  marginal  density  of  y.  And,  once  again,  the  same 
solution  applies:  using  an  importance  sampling  experiment  to  approximate 
the  integral  works  when  the  importance  function  is  a  multivariate  normal 
(or  t)  distribution  with  mean  (approximately)  E[/3|y]  and  covariance  matrix 
(approximately)  2  x  V(/3|y)  using  the  Metropolis-Hastings  approximations 
reported  in  Table  4.2.  We  can  therefore  approximate  Bayes  factors  for  testing 
all  possible  structures  of  the  log-linear  model. 
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Fig.  4.9.  Dataset  airquality:  Same  legend  as  Fig.  4.8  for  the  posterior  distribu¬ 
tion  (4.9)  as  target 
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For  airquality,  we  illustrate  this  ability  by  testing  the  presence  of  two- 
by-two  interactions  between  the  three  variables.  We  thus  compare  the  largest 
non-saturated  model  with  each  submodel  where  one  interaction  is  removed. 
An  ANOVA-like  output  is 

Effect  loglO(BF) 

u:v  6.0983  (****) 

u:w  -0.5732 

v : w  6.0802  (****) 

evidence  against  HO:  (****)  decisive,  (***)  strong, 

(**)  substantial,  (*)  poor 


which  means  that  the  interaction  between  u  and  w  (that  is,  ozone  and  month) 
is  too  small  to  be  significant  given  all  the  other  effects.  (Note  that  it  would 
be  excessive  to  derive  from  this  lack  of  significance  a  conclusion  of  indepen¬ 
dence  between  u  and  w  because  this  interaction  is  conditional  on  all  other 
interactions  in  the  complete  nonsaturated  model.) 

The  above  was  obtained  by  the  following  R  code:  first  we  simulated  an 
importance  sample  towards  approximating  the  full  model  integrated  likelihood 

mklog=apply (noinf loglin , 2 , mean) 
vklog=var (noinf loglin) 
simk=rmnorm ( 100000 ,mklog, 2*vklog) 
usk=loglinnoinf lpost (simk, counts ,X) - 
dmnorm ( s imk , mklog , 2*vklog , log=T ) 

then  reproduced  this  computation  for  the  three  corresponding  submodels, 
namely 

noinf loglinl=hmnoinf loglin (10~4, counts ,X [, -8] ,0.5) 
mkl=apply (noinf loglinl ,2, mean) 
vkl=var (noinf loglinl) 
simkl=rmnorm (100000 , mkl , 2*vkl ) 
uskl=loglinnoinf lpost (simkl , counts , X [ , -8]  )  - 
dmnorm (simkl ,mkl , 2*vkl , log=T) 
bf lloglin=mean(exp (usk) ) /mean(exp (uskl) ) 

and  the  same  pattern  with 

noinf loglin2=hmnoinf loglin (10 ~4, counts , cbind(X [,-(9:12)]  ,0.5) 

and 

noinf loglin3=hmnoinf loglin (10 ~4, counts ,X [, 1 : 12] ,0.5) 
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4.6  Exercises 


4.1  Show  that,  for  the  logistic  regression  model,  the  statistic  Y^i=i  V 's  SLJf_ 
ficient  when  conditioning  on  the  x2’s  (1  <  i  <  n),  and  give  the  corresponding 
family  of  conjugate  priors. 

4.2  Show  that  the  logarithmic  link  is  the  canonical  link  function  in  the  case  of 
the  Poisson  regression  model. 

4.3  Suppose  2/1,  •  •  • ,  Uk  are  independent  Poisson  random  variables.  Show 

that,  conditional  on  n  =  Y^=  i 

y  =  (j/i, . . .  ,yk)  ~  ^k(n;a i, . . .  ,afe) , 


and  determine  the  c^'s. 


4.4  For  7T  the  density  of  an  inverse  normal  distribution  with  parameters  =  3/2 
and  ^2  =  2, 

7r(x)  oc  x-3/2  exp(— 3/2x  —  2/x)Ix>o, 

write  down  and  implement  an  independence  MH  sampler  with  a  Gamma  proposal 
with  parameters  (a, /?)  =  (4/3,1)  and  (a,  f3)  =  (0. 5^4/3,  0.5). 

4.5  Consider  xi,  X2,  and  X3  iid  ^(0, 1),  and  7r(0)  oc  exp(— 02/lOO).  Show  that 
the  posterior  distribution  of  0,  7r(0|xi,  #2, #3),  is  proportional  to 


exp(— 02/lOO)[(l  +  (0  -  +  (0  -  x2)2)(l  +  (0  -  x3)2)] 


1 


and  that  it  is  trimodal  when  #1  =  0,  x2  =  5,  and  £3  =  9.  Using  a  random  walk 
based  on  the  Cauchy  distribution  ^  (0,  cr2),  estimate  the  posterior  mean  of  0  using 
different  values  of  a2.  In  each  case,  monitor  the  convergence. 

4.6  Estimate  the  mean  of  a  4. 3,  6.2)  random  variable  using 

1.  direct  sampling  from  the  distribution  via  the  R  command 

>  x=rgamma(n,4.3,scale=6.2) 

2.  Metropolis-Hastings  with  a  £fa( 4,7)  proposal  distribution; 

3.  Metropolis-Hastings  with  a  £fa( 5,6)  proposal  distribution. 

In  each  case,  monitor  the  convergence  of  the  cumulated  average. 

4.7  For  a  standard  normal  distribution  as  target,  implement  a  Hastings- 
Metropolis  algorithm  with  a  mixture  of  five  random  walks  with  variances 
a  =  0.01,0.1, 1, 10, 100  and  equal  weights.  Compare  its  output  with  the  output 
of  Fig.  4.3. 

4.8  For  the  probit  model  under  flat  prior,  find  conditions  on  the  observed  pairs 
(xz,2/i)  for  the  posterior  distribution  above  to  be  proper. 

4.9  For  the  probit  model  under  non-informative  prior,  find  conditions  on  JE 
and  Ei(!  —  yi)  for  the  posterior  distribution  defined  by  (4.4)  to  be  proper. 
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4.10  Include  an  intercept  in  the  probit  analysis  of  bank  and  run  the  correspond¬ 
ing  version  of  Algorithm  4.7  to  discuss  whether  or  not  the  posterior  variance  of 
the  intercept  is  high. 

4.11  Using  the  latent  variable  representation  of  the  probit  model,  introduce 

Zi\(3  ~  JC  (xxT/3,  l)  (1  <  i  <  n)  such  that  yi  =  IZi<o-  Deduce  that 

j  (xiT/3, 1,0)  if  yi  =  1, 

ZiVhf3~  |tyT_(xiT/3,l,0)  if  yi  =  0, 

where  M+  (/i,  1, 0)  and  Ml  (/i,  1, 0)  are  the  normal  distributions  with  mean  fi  and 
variance  1  that  are  left-truncated  and  right-truncated  at  0,  respectively.  Check 
that  those  distributions  can  be  simulated  using  the  R  commands 

>  xp=qnorm(runif (1) *pnorm(mu)+pnorm(-mu) )+mu 

>  xm=qnorm(runif (1) *pnorm(-mu) )+mu 

Under  the  flat  prior  7 r(/3)  oc  1,  show  that 

(3\y,  z  ~  ((XTX)-1XTz,  (XTX)-1)  , 

where  z  =  (zi, . . . ,  zn),  and  derive  the  corresponding  Gibbs  sampler,  sometimes 
called  the  Albert-Chib  sampler.  ( Hint  A  good  starting  point  is  the  maximum 
likelihood  estimate  of  f3.)  Compare  the  application  to  bank  with  the  output  in 
Fig.  4.4.  (Note:  Account  for  differences  in  computing  time.) 

4.12  For  the  bank  dataset  and  the  probit  model,  compute  the  Bayes  factor 
associated  with  the  null  hypothesis  Hq  :  /?2  =  fa  =  0. 

4.13  In  the  case  of  the  logit  model — i.e.,  when  pi  =  expx2T/3/{l  +  expx2_r/3} 
(1  <  i  <  k) — derive  the  prior  distribution  on  (3  associated  with  the  prior  (4.6)  on 

(pi,  •  •  •  ,Pk)- 

4.14  Examine  whether  or  not  the  sufficient  conditions  for  propriety  of  the  pos¬ 
terior  distribution  found  in  Exercise  4.9  for  the  probit  model  are  the  same  for  the 
logit  model. 

4.15  For  the  bank  dataset  and  the  logit  model,  compute  the  Bayes  factor  as¬ 
sociated  with  the  null  hypothesis  H0  :  /32  =  /?3  =  0  and  compare  its  value  with 
the  value  obtained  for  the  probit  model  in  Exercise  4.12. 

4.16  Given  a  contingency  table  with  four  categorical  variables,  determine  the 
number  of  submodels  to  consider. 

4.17  In  the  case  of  a  2x2  contingency  table  with  fixed  total  count  n  = 
n ii  -fni2  +n 21  +n 22,  we  denote  by  #11, 0 12,  #21?  ^22  the  corresponding  probabil¬ 
ities.  If  the  prior  on  those  probabilities  is  a  Dirichlet  ^4(1/2, . . . ,  1/2),  give  the 
corresponding  marginal  distributions  of  a  =  On  +  0 12  and  /3  =  On  +  #21-  Deduce 
the  associated  Bayes  factor  if  H0  is  the  hypothesis  of  independence  between  the 
factors  and  if  the  priors  on  the  margin  probabilities  a  and  [3  are  those  derived 
above. 
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he  hadn’t  landed  in  a  trap. 

Ian  Rankin,  Resurrection  Men. — 


Roadmap 

This  chapter  deals  with  a  very  special  case  of  survey  models.  Surveys  are  used 
in  many  settings  to  evaluate  some  features  of  a  given  population,  including  its 
main  characteristic,  the  size  of  the  population.  In  the  case  of  capture-recapture 
surveys,  individuals  are  observed  and  identified  either  once  or  several  times  and  the 
repeated  observations  can  be  used  to  draw  inference  on  the  population  size  and 
its  dynamic  characteristics.  Along  with  the  original  model,  we  will  also  introduce 
extensions  that  can  be  seen  as  a  first  entry  into  hidden  Markov  chain  models, 
detailed  further  in  Chap.  6.  In  particular,  we  cover  the  generic  Arnason-Schwarz 
model  that  is  customarily  used  by  biologists  for  open  populations. 

On  the  methodological  side,  we  provide  an  introduction  to  the  accept-reject 
method,  which  is  the  central  simulation  technique  behind  most  standard  random 
generators  and  relates  to  the  Metropolis-Hastings  methodology  in  many  ways. 
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5.1  Inference  in  a  Finite  Population 

In  this  chapter,  we  consider  the  problem  of  estimating  the  unknown  size,  TV, 
of  a  population,  based  on  a  survey ;  that  is,  on  a  partial  observation  of  this 
population.  To  be  able  to  evaluate  a  population  size  without  going  through 
the  enumeration  of  all  its  members  is  obviously  very  appealing,  both  timewise 
and  moneywise,  especially  when  sampling  those  members  has  a  perturbing 
effect  on  them. 

A  primary  type  of  survey  (which  we  do  not  study  in  this  chapter)  is  based 
on  knowledge  of  the  structure  of  the  population.  For  instance,  in  a  political 
survey  about  voting  intentions,  we  build  a  sample  of  1,000  individuals,  say, 
such  that  the  main  sociological  groups  (farmers,  civil  servants,  senior  citizens, 
etc.)  are  represented  in  proportion  in  the  sample.  In  that  situation,  there  is  no 
statistical  inference,  so  to  speak,  except  about  the  variability  of  the  responses, 
which  are  in  the  simplest  cases  binomial  variables. 

Obviously,  such  surveys  require  primary  knowledge  of  the  population, 
which  can  be  obtained  either  by  a  (costly)  census,  like  those  that  states  run 
every  5  or  10  years,  or  by  a  preliminary  exploratory  survey  that  aims  at  uncov¬ 
ering  these  hidden  structures.  This  secondary  type  of  survey  is  the  purpose  of 
this  chapter,  under  the  name  of  capture-recapture  (or  capture-mark-recapture ) 
experiments,  where  a  few  individuals  sampled  at  random  from  the  population 
of  interest  bring  some  information  about  the  characteristics  of  this  population 
and  in  particular  about  its  size. 

The  capture-recapture  models  were  first  used  in  biology  and  ecology  to 
estimate  the  size  of  animal  populations,  such  as  herds  of  caribous  (e.g.,  for 
culling)  or  of  whales  (e.g.,  for  the  International  Whaling  Commission  to 
determine  fishing  quotas),  cod  populations,  and  the  number  of  different  species 
in  a  particular  area.  While  our  illustrative  dataset  will  be  related  to  a  biolog¬ 
ical  problem,  we  stress  that  these  capture-recapture  models  apply  in  a  much 
wider  range  of  domains,  such  as,  for  instance, 

sociology  and  demography,  where  the  estimation  of  the  size  of  populations 
at  risk  is  always  delicate  (e.g.,  homeless  people,  prostitutes,  illegal 
migrants,  drug  addicts,  etc.); 

official  statistics  for  reducing  the  cost  of  a  census1 2  or  improving  its 
efficiency  on  delicate  or  rare  subcategories  (as  in  the  U.S.  census  under¬ 
count  procedure  and  the  new  French  census); 

finance  (e.g.,  in  credit  scoring,  defaulting  companies,  etc.)  and  marketing 
(consumer  habits,  telemarketing,  etc.); 


1In  the  most  extreme  cases,  sampling  an  individual  may  lead  to  its  destruction, 
as  for  instance  in  forestry  when  estimating  the  volume  of  trees  or  in  meat  production 
when  estimating  the  content  of  fat  in  meat. 

2 Even  though  a  census  is  formally  a  deterministic  process  since  it  aims  at  the 
complete  enumeration  of  a  given  population,  it  inevitably  involves  many  random 
components  at  the  selection,  collection,  and  processing  levels  (Sarndal  et  al.,  2003). 
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fraud  detection  (e.g.,  phone,  credit  card,  etc.)  and  document  authentication 
(historical  documents,  forgery,  etc.);  and 

software  debugging,  to  determine  an  evaluation  of  the  number  of  bugs  in 
a  computer  program. 

In  these  different  examples,  the  size  N  of  the  whole  population  is  unknown 
but  samples  (with  fixed  or  random  sizes)  can  easily  be  extracted  from  the 
population.  For  instance,  in  a  computer  program,  the  total  number  N  of 
bugs  is  unknown  but  one  can  record  the  number  n\  of  bugs  detected  in  a 
given  perusal.  Similarly,  the  total  number  N  of  homeless  people  in  a  city  like 
Philadelphia  at  a  given  time  is  not  known  but  it  is  possible  to  count  the 
number  n\  of  homeless  persons  in  a  given  shelter  on  a  precise  night,  to  record 
their  ID,  and  to  cross  this  sample  with  a  sample  of  n 2  persons  collected  the 
night  after  in  order  to  detect  how  many  persons  n\2  were  present  in  the  shelter 
on  both  nights. 

The  dataset  we  consider  throughout  this  chapter  is  called  eurodip  and 
is  related  to  a  population  of  birds  called  European  dippers  (Cinclus  cinclus). 
These  birds  are  closely  dependent  on  streams,  feeding  on  underwater  inverte¬ 
brates,  and  their  nests  are  always  close  to  water.  The  capture-recapture  data 
on  the  European  dipper  contained  in  eurodip  covers  7  years  (1981-1987  in¬ 
clusive)  of  observations  in  a  zone  of  200  km2  in  eastern  France.  The  data 
consist  of  markings  and  recaptures  of  breeding  adults  each  year  during  the 
breeding  period  from  early  March  to  early  June.  Birds  were  at  least  1  year  old 
when  initially  banded.  In  eurodip,  each  row  of  seven  digits  corresponds  to  a 
capture-recapture  story  for  a  given  dipper,  0  indicating  an  absence  of  capture 
that  year  and,  in  the  case  of  a  capture,  1,  2,  or  3  representing  the  zone  where 
the  dipper  is  captured.  For  instance,  the  three  lines  from  eurodip 

1  0  0  0  0  0  0 

1  3  0  0  0  0  0 

0  2  2  2  1  2  2 

indicate  that  the  first  dipper  was  only  captured  the  first  year  in  zone  1  and 

that  the  second  dipper  was  captured  in  years  1981  and  1982  and  moved  from 
zone  1  to  zone  3  between  those  years.  The  third  dipper  was  captured  every 
year  but  1981  and  moved  between  zones  1  and  2  during  the  remaining  year. 


In  conclusion,  we  hope  that  the  introduction  above  was  motivating  enough 
to  convince  the  reader  that  population  sampling  models  are  deeply  relevant 
in  statistical  practice.  Besides,  these  models  also  provide  an  interesting  appli¬ 
cation  of  Bayesian  modeling  and  in  particular  they  allow  for  the  inclusion  of 
often  available  prior  information. 
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5.2  Sampling  Models 


5.2.1  The  Binomial  Capture  Model 


We  start  with  the  simplest  model  of  all,  namely  the  independent  observation 
or  capture 3  of  n+  individuals  from  a  population  of  size  N.  For  instance,  a 
trap  is  positioned  on  a  rabbit  track  for  five  hours  and  n+  rabbits  are  found  in 
the  trap.  While  the  population  size  N  E  N  is  the  parameter  of  interest,  there 
exists  a  nuisance  parameter,  namely  the  probability  p  E  [0, 1]  with  which  each 
individual  is  captured.  (This  model  assumes  that  catching  the  zth  individual 
is  independent  of  catching  the  j th  individual.)  For  this  model, 

n+  ~  &(N,p) 


and  the  corresponding  likelihood  is 


£{N,p\n+)  =  pn+ (1 -p)N  n+lN>n+. 

Obviously,  with  a  single  observation  n+,  we  cannot  say  much  on  (7V,p),  but 
the  posterior  distribution  is  still  well-defined.  For  instance,  if  we  use  the  vague 
prior 

ir(N,p)  cx  N-'MN)!^) , 


the  posterior  distribution  of  N  is 


7r(fV|n+)  oc 


N\ 


N-nN>n+iN,(N)  f  Pn+  (i  -p) 

J  o 


N-n+ 


OC 


I 


(N  —  n+)!n+! 

(N  —  1)!  (N  —  n+)! 

(N  —  n+jl  (N  +  l)\  RN^n+vl 

1 

N(N  +  1) 


dp 


I 


N>n+  VI 


(5.1) 


where  n+  V  1  =  max(n+,  1).  Note  that  this  posterior  distribution  is  defined 
even  when  n+  =0.  If  we  use  the  (more  informative)  uniform  prior 

n(N,p)  cxI{lj...jS}(Af)I[0jl](p), 


the  posterior  distribution  of  N  is 


1 


7r(lV|n+)  oc  jj— jI{n+vi,...,s}(N) 


Q 

We  use  the  original  terminology  of  capture  and  individuals ,  even  though  the 
sampling  mechanism  may  be  far  from  genuine  capture,  as  in  whale  sightseeing  or 
software  bug  detection. 
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For  illustrative  purposes,  consider  the  case  of  year  1981  in  eurodip  (which 
is  the  first  column  in  the  hie): 

>  data(eurodip) 

>  year81=eurodip [, 1] 

>  nplus=sum(year81>0) 

[1]  22 


where  n+  =  22  dippers  were  thus  captured.  By  using  the  binomial  capture 
model  and  the  vague  prior  7r(TV,p)  oc  TV-1,  the  number  of  dippers  N  can  be 
estimated  by  the  posterior  median.  (Note  that  the  mean  of  (5.1)  does  not 
exist,  no  matter  what  n+  is.) 


>  N=max(nplus , 1) 

>  rangd=N:  (1(T4*N) 

>  post=l/ (rangd* (rangd+1) ) 

>  l/sum(post) 

[1]  22.0022 

>  post=post/sum(post) 

>  min (rangd [cumsum(post) > . 5]  ) 
[1]  43 


For  this  year  1981,  the  estimate  of  N  is  therefore  43  dippers.  (See  Exercise  5.1 
for  theoretical  justifications  as  to  why  the  sum  of  the  probabilities  is  equal  to 
n+  and  why  the  median  is  exactly  2n+  — 1.)  If  we  use  the  ecological  information 
that  there  cannot  be  more  than  400  dippers  in  this  region,  we  can  take  the 
prior  7 r(7V,  p)  oc  I{i,...,4oo} (^)I[o,i]  (p)  and  estimate  the  number  of  dippers  N 
by  its  posterior  expectation: 

>  pbino=function(nplus) { 

+  prob=c(rep(0,max(nplus, 1)-1) , 1/ (max(nplus , 1) : 400+1)) 

+  prob/sum(prob) 

+  } 

>  sum( (1 : 400) *pbino (nplus) ) 

[1]  130.5237 


5.2.2  The  Two-Stage  Capture— Recapture  Model 

A  logical  extension  to  the  capture  model  above  is  the  capture-mark-recapture 
model,  which  considers  two  capture  periods  plus  a  marking  stage,  as  follows: 

1.  n i  individuals  from  a  population  of  size  N  are  “captured” ,  that  is,  sampled 
without  replacement. 

2.  Those  individuals  are  “marked”,  that  is,  identified  by  a  numbered  tag 
(for  birds  and  fishes),  a  collar  (for  mammals),  or  another  device  (like  the 
Social  Security  number  for  homeless  people  or  a  picture  for  whales),  and 
they  are  then  released  into  the  population. 
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3.  A  second  and  similar  sampling  (once  again  without  replacement)  is 
conducted,  with  n 2  individuals  captured. 

4.  m2  individuals  out  of  the  n 2s  bear  the  identification  mark  and  are  thus 
characterized  as  having  been  captured  in  both  experiments. 

If  we  assume  a  closed  population  (that  is,  a  fixed  population  size  TV 
throughout  the  capture  experiment),  a  constant  capture  probability  p  for  all 
individuals,  and  complete  independence  between  individuals  and  between  cap¬ 
tures,  we  end  up  with  a  product  of  binomial  models, 


ni  ~  &(N,p) , 


m2\ni  ~  &(ni,p) 


and 


n2 


ui2  \n\ ,  rn2  ~ 


nup) . 


If 


nc  =  ni  +  n2  and  n+  =  n\  +  (n2  —  m2) 


denote  the  total  number  of  captures  over  both  periods  and  the  total  number 
of  captured  individuals,  respectively,  the  corresponding  likelihood  i(N,p\n\, 
n2lm2)  is 


TV 

n2 


n  1 

rri2 


I  p 


T12-TO2 


(!  ~P) 


N—ni—n2-\-rri2 


I{o,...,Ar-m}(n2  —  rn2) 


x  1 ni  ]pm2(i-Pr-mdNynic-p)N~nih0,...M^) 


rn2 


TV! 


oc 


oc 


ni 

p'^  1  '“2  (1  —  p) 


(TV  -  ni  -  n2  +  m2)\ 


ni+n2/i  ^\2N—  m—  n2  j 


N>n+ 


which  shows  that  (nc,n+)  is  a  sufficient  statistic.  If  we  choose  the  prior 
7r(fV,p)  =  7r(7V)7r(p)  such  that  tt (p)  is  a  ^([0,1])  density,  the  conditional 
posterior  distribution  on  p  is  such  that 

7r(p|fV,  ni,  n2l  m2)  =  7r(p|TV,  nc)  oc  pn  (1  —p)2N~n  ; 


that  is, 


p  TV,  n( 


~  £$e(nc  +  1,  27V 


nc  +  1). 


Unfortunately,  the  marginal  posterior  distribution  of  TV  is  more  complicated. 
For  instance,  if  tt(N)  =  In*(7V),  it  satisfies 


7r(TV|ni,  ri2,  m2)  =  7r(TV|nc,n+)  oc 


B(nc  +  1,  27V  —  nc  +  l)I/v>n+vi  ? 
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where  B(a,b)  denotes  the  beta  function  This  distribution  is  called  a 
beta-Pascal  distribution,  but  it  is  not  very  tractable.  The  same  difficulty 
occurs  if  7 r(7V)  =  7V_1%*(7V). 

The  intractability  in  the  posterior  distribution  7r(TV|ni,  ri2,  m2)  is  due  to 
the  infinite  summation  resulting  from  the  unbounded  support  of  TV.  A  feasible 
approximation  is  to  replace  the  missing  normalizing  factor  by  a  finite  sum 
with  a  large  enough  bound  on  TV,  the  bound  being  determined  by  a  lack 
of  perceivable  impact  on  the  sum.  But  the  approximation  errors  due  to  the 
computations  of  terms  such  as  (^)  or  B[nc  +  1,  27V  —  nc  +  1)  can  become  a 
serious  problem  when  n+  is  large.  However, 

>  prob=lchoose ( (471570 : 10~7) , 471570) +lgamma (2* (471570 : 10~7) - 

+  582681+1) -lgamma (2* (471570 : 10~7)+2) 

>  range (prob) 

[1]  -7886469  -7659979 


shows  that  relatively  large  populations  are  manageable. 

If  we  have  information  about  an  upper  bound  S'  on  TV  and  use  the  corre¬ 
sponding  uniform  prior, 


7T(iV)cxI{W}(iV), 

the  posterior  distribution  of  TV  is  thus  proportional  to 


TV  \  r(2N  —  nc  +  1) 


TWTY)  . S,( AO, 


and,  in  this  case,  it  is  possible  to  calculate  the  posterior  expectation  of  TV  with 
no  approximation  error. 


For  the  first  2  years  of  the  eurodip  experiment,  which  correspond  to  the 
first  two  columns  and  the  first  70  rows  of  the  dataset,  n  1  =  22,  =  60,  and 

m 2  =  11.  Hence,  nc  =  82  and  n+  =  71.  Therefore,  within  the  frame  of  the 
two-stage  capture-recapture  model  and  the  uniform  prior  ^({1, . . . ,  400})  x 
^([0,1])  on  (TV, _p) ,  the  posterior  expectation  of  N  is  derived  as  follows: 

>  nl=sum (eurodip [, 1]  >0) 

>  n2=sum (eurodip [,2]  >0) 

>  m2=sum( (eurodip [, 1]  >0)  &  (eurodip [, 2] >0) ) 

>  nc=nl+n2 

>  nplus=nc-m2 

>  pcapture=function(T,nplus ,nc) { 

+  #T  is  the  number  of  capture  episodes 

+  lprob=lchoose (max(nplus , 1) :400,nplus)+ 
lgamma (T*max(nplus , 1) : 400-nc+l) - 

+  lgamma (T*max(nplus , 1) : 400+2) 

+  prob=c (rep (0 , max (nplus , 1 ) - 1 ) , exp (lpr ob-max (lprob) ) ) 


4This  analysis  is  based  on  the  assumption  that  all  birds  captured  in  the  second 
year  were  already  present  in  the  population  during  the  first  year. 
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+  prob/sum(prob) 

+  } 

>  sum( (1 : 400) *pcapture (2 ,nplus ,nc) ) 
[1]  165.2637 


A  simpler  model  used  in  capture-recapture  settings  is  the  hypergeometric 
model,  also  called  the  Darroch  model  This  model  can  be  seen  as  a  conditional 
version  of  the  two-stage  model  when  conditioning  on  both  sample  sizes  n\  and 
n2  since  (see  Exercise  5.3) 


m2|ni,n2  ~  J4?(N,  n2,  n\/N) , 


the  hypergeometric  distribution  If  we  choose  the  uniform  prior  ^({1, . . . ,  400}) 
on  TV,  the  posterior  distribution  of  N  is  thus 

ir(N\m2)  oc  (N  Hl  )  /  (Ar)l{„+vi,...,400}(AO  , 

\n2  -m2J  /  \n2  ) 

and  posterior  expectations  can  be  computed  numerically  by  simple  summa¬ 
tions. 


For  the  first  2  years  of  the  eurodip  dataset  and  S  =  400,  the  posterior 
distribution  of  N  for  the  Darroch  model  is  given  by 

7r(N\rri2)  oc  (n  —  n\)\(N  —  ri2)\ / {(n  —  n\  —  n 2  +  m2)!A^!}  I{7i,...  ,400}  {N)  , 

the  normalization  factor  being  the  inverse  of 

400 

^  (fc  -  ni)\(k  -  n2)!/{(fc  -  n\  -  n2  +  m2)!/c!}  . 

k=71 

We  thus  have  a  closed-form  posterior  distribution  and  the  posterior  expecta¬ 
tion  of  N  is  given  by 

pdarroch=function(nl ,n2 ,m2) { 

prob=c (rep (0 ,max(nl+n2-m2 , 1) -1) , 

choose (nl ,m2) *choose(max( (nl+n2-m2) , 1) :400-nl ,n2-m2)/ 
choose (max( (nl+n2-m2) , 1) :400 ,n2) ) 
prob/sum (prob) 

> 

>  sum ( ( 1 : 400) *pdarroch (nl , n2 ,m2) ) 

[1]  137.5962 
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Table~5.1  shows  the  evolution  of  this 

posterior  expectation  for  different  values  of  m_2, 

obtained  by 


>  for  (i  in  6:16)  print (round ( sum (pdarroch(nl ,n2 , i) *1 : 400) ) ) 

[1]  277 

[1]  252 

[1]  224 

[1]  197 

[1]  172 

[1]  152 

[1]  135 

[1]  122 

[1]  HI 

[1]  101 

[1]  94 


The  number  of  recaptures  is  thus  highly  influential  on  the  estimate  of  TV.  In 
parallel,  Table  5.2  shows  the  evolution  of  the  posterior  expectation  for  different 
values  of  S  (taken  equal  to  400  in  the  above).  When  S  is  large  enough,  say 
larger  than  S  =  250,  the  estimate  of  TV  is  quite  stable,  as  expected. 


Table  5.1.  Dataset  eurodip:  Rounded  posterior  expectation  of  the  dipper  popu¬ 
lation  size,  TV,  under  a  uniform  prior  ^({1, . . . ,  400}) 


m2 

0 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

11 

12 

13 

14 

15 

F[N\m2\ 

355 

349 

340 

329 

316 

299 

277 

252 

224 

197 

172 

152 

135 

122 

110 

101 

Table  5.2.  Dataset  eurodip:  Rounded  posterior  expectation  of  the  dipper  popu¬ 
lation  size,  TV,  under  a  uniform  prior  ({1, . . . ,  *9}),  for  m2  =  11 


S 

100  150  200  250  300  350  400  450  500 

[N\m,2 

95  125  141  148  151  151  152  152  152 

Leaving  the  Darroch  model  and  getting  back  to  the  two-stage  capture 
model  with  probability  p  of  capture,  the  posterior  distribution  of  (TV,  p)  asso¬ 
ciated  with  the  noninformative  prior  7t(TV,  p)  =  1/TV  is  proportional  to 


(TV  -  1)! 
(TV  —  n+)! 


pn  (1  —  p)2N  n  . 


Thus,  if  n+  >  0,  both  conditional  posterior  distributions  are  standard 
distributions  since 
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p|nc,  TV  ~  £$e{nc  +  1,  2TV  —  nc  +  1) 
TV  —  n+|n+,p  ~  uCeg(n+,  1  —  (1  —  p )2) , 


the  latter  being  a  negative  binomial  distribution.  Indeed,  as  a  function  of  TV, 


(TV  -  1)! 

(N-n+)! 


(!  ~P) 


2N—nl 


OC 


)  {(! -P?}N  {!  -  (1  -P)2} 


n 


+ 


Therefore,  while  the  marginal  posterior  in  TV  is  difficult  to  manage,  the  joint 
distribution  of  (TV,  _p)  can  be  approximated  by  a  Gibbs  sampler,  as  follows: 


Algorithm  5.8  Two-stage  Capture-Recapture  Gibbs  Sampler 

Initialization:  Generate  p ~^([0, 1]). 

Iteration  i  (i  >  1): 

1.  Generate  TVW  —  n+  ~  ,/Ceg(n+,  1  —  (1  —  p^-1^)2). 

2.  Generate  pW  ~  nc  +  1,2 -nc  +  1). 


5.2.3  The  T-Stage  Capture— Recapture  Model 


A  further  extension  to  the  two-stage  capture-recapture  model  is  to  consider 
instead  a  series  of  T  consecutive  captures.  In  that  case,  if  we  denote  by  nt  the 
number  of  individuals  captured  at  period  t  (1  <  t  <  T)  and  by  mt  the  number 
of  recaptured  individuals  (with  the  convention  that  mi  =  0),  under  the  same 
assumptions  as  in  the  two-stage  model,  then  m  ~  p)  and,  conditionally 

on  the  j  —  1  previous  captures  and  recaptures  (2  <  3  <  T ), 


'j~  1 


3~ 1 


mj  ~ 


^^(nt  —  mt),p  )  and  rij  —  mj  ~  SS  I  TV  —  ^^(ru  —  mt),p 


t= l 


t= l 


The  likelihood  £(TV,p|ni,  712,  ?R2  •  •  • ,  Rt,  tut)  is  thus 


'TV' 

ni 


T 


pni(l 


i=2  L 


N  ~  Z)t=i(n*  -  mt) 


nj  ~  mj 


X  (1 

V 


p 


rij  —rrij  -\-rrij 


TV! 


oc 


pn^(l-p)T7V  nCl7V>n+ 


(TV  —  n+)! 


if  we  denote  the  sufficient  statistics  as 


T 


T 


n 


+ 


=  ^^(nt  —  mt )  and  nc  =  ^ 


t=i 


n  =  >  nt 

t= l 
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the  total  numbers  of  captured  individuals  and  captures  over  the  T  periods, 
respectively. 

For  a  noninformative  prior  such  as  7r(TV,p)  =  1  /TV,  the  joint  posterior 
satisfies 


n(N,p\n+,nc)  oc  N-  pn°  {1  -  p) 


TN  —  nc 


I 


N>n+  VI 


Therefore,  the  conditional  posterior  distribution  of  p  is 


p|TV,  n+,  nc  ~  £$e(nc  +  1,  TN  —  nc  +  1) 


and  the  marginal  posterior  distribution  of  TV 


7r(TV|n+ 


(TV-1)!  (TN  —  nc)\ 

U  >  00  (TV  —  n+)!  (TN  +  1)!  -n+vl 


is  computable.  Note  that  the  normalization  coefficient  can  also  be  approxi¬ 
mated  by  summation  with  an  arbitrary  precision  unless  N  and  n+  are  very 
large. 

For  the  uniform  prior  ^({1,. ..,£})  on  TV  and  ^([0, 1])  on  p,  the  posterior 
distribution  of  TV  is  then  proportional  to 


7r(TV|n+)  oc 


(TN  -nc)\ 
(TN  +  1)! 


I 


{n+Vl,...,S}(A0- 


For  the  whole  set  of  observations  in  eurodip,  we  have  T  =  7,  n+  = 
294,  and  nc  =  519.  Under  the  uniform  prior  with  S  =  400,  the  posterior 
expectation  of  TV  is  given  by 

>  sum( (1 :400)*pcapture(7,294,519) ) 

[1]  372.7384 

While  this  value  seems  dangerously  close  to  the  upper  bound  of  400  on  TV 
and  thus  leads  us  to  suspect  a  strong  influence  of  the  upper  bound  S',  the 
computation  of  the  posterior  expectation  for  S  =  2500 

>  S=2500;T=7;nplus=294;nc=519 

>  lprob=lchoose (max(nplus , 1) :S,nplus)+ 

+  lgamma(T*max(nplus , 1) : S-nc+1) -lgamma(T*max(nplus , 1) :S+2) 

>  prob=c (rep (0 ,max(nplus , 1)-1) , exp ( lpr ob-max (lprob) ) ) 

>  sum( (1 : S) *prob) /sum(prob) 

[1]  373.9939 

leads  to  373.99,  which  shows  the  limited  impact  of  this  hyperparanreter  S. 


Using  even  a  slightly  more  advanced  sampling  model  may  lead  to  genuine 
computational  difficulties.  For  instance,  consider  a  heterogeneous  capture-re¬ 
capture  model  where  the  individuals  are  captured  at  time  1  <  t  <  T  with 
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probability  pt  and  where  both  the  size  N  of  the  population  and  the  probabil¬ 
ities  pt  are  unknown.  The  corresponding  likelihood  is 


£(N,pi, . . .  ,PT\ni,n2,m2  ■  •  •  ,nT,mT)  oc 


N\ 


T 


(- N-n+y.  ^ 


n  Pt'i1  -p^ 


N—m 


If  the  associated  prior  on  (7V,pi, . . .  ,pr)  is  such  that 


n  ~  &>{ a) 


and  (1  <  t  <  T) 


at  =  log 


Pt 


1  ~Pt 


where  both  a2  and  the  pt  s  are  known,5  the  posterior  distribution  satisfies 


7r(c^i,  .  .  .  ,  OiT  1  N  ,  77-1,  •  •  •  ,  Wt)  OC 


N\  XN 


T 


(N  —  77,+)!  TV!  A- 


IR^) 


—  N 


(5.3) 


T 

x  exp 
t= l 


| atnt  - 


1 

2^2 


Pt) 


It  is  thus  much  less  manageable  from  a  computational  point  of  view,  especially 
when  there  are  many  capture  episodes.  A  corresponding  Gibbs  sampler  could 
simulate  easily  from  the  conditional  posterior  distribution  on  TV  since 


T 


TV  — 


n 


+ 


a,  n+  ~ 


^  Ajj(l  + 


,Oit 


1 


t=  1 


but  the  conditionals  on  the  a^’s  (1  <  t  <  T)  are  less  conventional, 
at | TV,  n  ~  7Tt(at|TV,  n)  oc  (1  +  eat^-N /2a  ^ 


and  they  require  either  an  accept-reject  algorithm  (Sect.  5.4)  or  a  Metropo- 
lis-Hastings  algorithm  in  order  to  be  simulated. 

For  the  prior 

XN 

k(N,p)  oc  —IN (N)I[0A](p), 
the  conditional  posteriors  are  then 


p 


TV, 


n 


£$e{nc  +  1,  TN  —  nc  +  1)  and  TV 


n 


+ 


p,  n+  ~ 


^(A(l  —  p)T) 


and  a  Gibbs  sampler  similar  to  the  one  developed  in  Algorithm  5.8  can  easily 
be  implemented,  for  instance  via  the  code 

5 This  assumption  can  be  justified  on  the  basis  that  each  capture  probability  is 
only  observed  once  on  the  tth  round  (and  so  cannot  reasonably  be  associated  with 
a  noninformative  prior). 
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>  lambda=200 

>  nsimu=10''4 

>  p=rep(l ,nsimu) ;  N=p 

>  N [1] =2*nplus 

>  p [1] =rbeta(l ,nc+l ,T*lambda-nc+l) 

>  for  (i  in  2:nsimu){ 

+  N [i]  =nplus+rpois (1 , lambda* ( 1 — p [i-1] ) ~T) 
+  p [i]  =rbeta(l ,nc+l ,T*N [i] -nc+1) 

+  > 


For  eurodip,  we  used  this  Gibbs  sampler  and  obtained  the  results 
illustrated  by  Fig.  5.1.  When  the  chain  is  initialized  at  the  (unlikely)  value 
N =  A  =  200  (which  is  the  prior  expectation  of  TV),  the  stabilization  of  the 
chain  is  quite  clear:  It  only  takes  a  few  iterations  to  converge  toward  the  proper 
region  that  supports  the  posterior  distribution.  We  can  thus  visually  confirm 
the  convergence  of  the  algorithm  and  approximate  the  Bayes  estimators  of  N 
and  p  by  the  Monte  Carlo  averages 

>  mean(N) 

[1]  326.9831 

>  mean(p) 

[1]  0.2271828 

The  precision  of  these  estimates  can  be  assessed  as  in  a  regular  Monte  Carlo 
experiment,  but  the  variance  estimate  is  biased  because  of  the  correlation 
between  the  simulations.  A  simple  way  to  assess  this  effect  is  to  call  R  function 
acf  ()  for  each  component  0i  of  the  parameter,  as 

oo 

v  =  l  +  2y^cor(0ft0f+1)) 

t=  1 

evaluates  the  loss  of  efficiency  due  to  the  correlation.  The  corresponding  ef¬ 
fective  sample  size ,  given  by  Tess  =  T/z/,  provides  the  equivalent  size  of  an  iid 
sample.  For  instance, 

>  1/ (l+2*sum(acf (N) $acf  [-1] ) ) 

[1]  0.599199 

>  1/ (l+2*sum(acf (p) $acf  [-1] ) ) 

[1]  0.6063236 

shows  that  the  current  Gibbs  sampler  offers  an  efficiency  of  60%  compared 
with  an  iid  sample  from  the  posterior  distribution. 
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Fig.  5.1.  Dataset  eurodip:  Representation  of  the  Gibbs  sampling  output  for  the 
parameters  p  ( first  column )  and  N  ( second  column ) 


5.3  Open  Populations 


Moving  towards  more  realistic  settings,  we  now  consider  the  case  of  an  open 
population  model,  where  the  population  size  does  not  remain  fixed  over  the 
experiment  but,  on  the  contrary,  there  is  a  probability  q  for  each  individual 
to  leave  the  population  at  each  time  (or,  more  accurately,  between  any  two 
capture  episodes).  Given  that  the  associated  likelihood  involves  unobserved 
indicators  (namely,  indicators  of  survival;  see  Exercise  5.14),  we  study  here 
a  simpler  model  where  only  the  individuals  captured  during  the  first  cap¬ 
ture  experiment  are  marked  and  subsequent  recaptures  are  registered.  For 
three  successive  capture  experiments,  we  thus  have 


rt\  ~  3S(N,p) , 


ri\ni  ~  3§{nuq) , 


r2  n i,ri 


~  &(ni  -ri,q), 


for  the  distributions  of  the  first  capture  population  size  and  of  the  numbers 
of  individuals  who  vanished  between  the  first  and  second,  and  the  second  and 
third  experiments,  respectively,  and 


C2\ni,ri  ~  3§{n\  -  n,p),  c3|ni,ri,r2  ~  S8(n i  -  n  -  r2,p ) , 
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for  the  number  of  recaptured  individuals  during  the  second  and  the  third 
experiments,  respectively.  Here,  only  n i,  C2,  and  c3  are  observed.  The  numbers 
of  individuals  removed  at  stages  1  and  2,  r\  and  r2,  are  not  available  and  must 
therefore  be  simulated,  as  well  as  the  parameters  TV,  p,  and  q.6  The  likelihood 
e(N,p,q,r1,r2\n1,c2,c3)  is  given  by 


pni(l -p)N"ni  r1)  qri  (1  -  q)ni~ri  riJpC2(l-p)ni-ri~C2 

X  ^  ~riV2(l  -g)ni-ri-r»f"1  ~r*  ~rApCs(l  -p)n i-ri-ra-cs 


and,  if  we  use  the  prior  7r(N,p,q)  oc  N  1I[o,i]  (p)I[o,i]  (<7),  the  associated 
conditionals  are 


,+ 


7r(p|fV,  <7,  f^*)  oc  pn  (1  -p)n+, 

7r(g|TV,p,  ^*)  oc  gri+r2  (1  —  q)2n  1_2ri_r2  ? 

7r(TV|p,  q,  @*)  oc  ^ 


7r(ri|p,  g,ni,c2,c3,r2)  oc 
7r(r2|p,  g,ni,c2,c3,ri)  oc 


(N-ni)\{1~P)  IjV^’ 

(ni  —  ri)!  <7ri  (1  —  g)_2ri  (1  —  p)-2ri 
ri!(ni  —  ri  —  r2  —  c3)!(ni  —  c2  —  ri)!  ’ 

r2!(ni  -  ri  -  r2  -  c3)!  ’ 


where  =  (m,  C2,  c3,  ri,  r2)  and 


Ul  =  N  -  ni,  ^2  =  ni  -  ri  -  c2,  u3  =  m  -  n  -  r2  -  c3  , 
n+  =  ni  +  c2  +  c3,  +  ^3 


(u  stands  for  unobserved ,  even  though  these  variables  can  be  computed  con¬ 
ditional  on  the  remaining  unknowns).  Therefore,  the  full  conditionals  are 


p|7V,  <7,  ^ 
q\N,p,  @ 
N  -  ni  p,q, 


* 


* 


* 


r2  p,  g,ni,c2,c3,ri  ~ 


^e(n+  +  1,  u+  +  1) , 

&e(ri  +  r2  +  1,  2ni  -  2ri  —  r2  +  1) , 

ep(ni,p) , 

ni  ri  C3’ <?  + (1 -?)(i -p) 


which  are  very  easily  simulated,  while  r\  has  a  less  conventional  distribu¬ 
tion.  However,  this  difficulty  is  minor  since,  in  our  case,  n\  is  not  extremely 


From  a  theoretical  point  of  view,  r  1  and  r 2  are  missing  variables  rather  than  true 
parameters.  This  obviously  does  not  change  anything  either  for  simulation  purposes 
or  for  Bayesian  inference. 
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large.  It  is  thus  possible  to  compute  the  probability  that  r\  is  equal  to  each 
of  the  values  in  {0, 1, ... ,  min(ni  —  r 2  —  03,77-1  —  C2)}.  This  means  that  the 
corresponding  Gibbs  sampler  can  be  implemented  as  well. 

gibbscapl=f unction (nsimu,nl , c2 , c3) { 


N=p=q=r l=r2=rep (0 , nsimu) 

N [1]  =round(nl/runif (1) ) 

rl [1] =max(c2 , c3)+round( (nl-c2) *runif (1) ) 
r2 [1] =round( (nl-rl  [1] -c3) *runif (1) ) 
nplus=nl+c2+c3 
for  (i  in  2:nsimu){ 

uplus=N  [i— 1] -rl  [i— 1] -c2+nl-rl [i— 1] -r2 [i— 1]  -c3 
p [i]  =rbeta(l ,nplus+l ,uplus+l) 

q [i] =rbeta(l ,rl  [i-1] +r2  [i-1] +1 , 2*nl-2*rl [i-1] -r2 [i-1] +1) 

N [i] =nl+rnbinom(l ,nl ,p [i] ) 
rbar=min(nl-r2  [i-1] -c3,nl-c2) 

pq=q[i]  / ( (l~q[i] ) * (1-p [i] ) ) ~2 

pr=lchoose (nl-c2 ,0 : rbar)+(0 : rbar) *log(pq)+ 
lchoose (nl- (0 : rbar) , r2  [i-1] +c3) 
rl [i] =sample (0 : rbar , 1 ,prob=exp(pr-max(pr) ) ) 
r2 [i] =rbinom(l , nl-rl [i] -c3 ,q[i] / (q[i]  +  (l-q[i]  ) * (1-p [i]  ) ) ) 
} 

list (N=N , p=p , q=q, r l=r 1 , r2=r2) 


We  stress  that  R  is  quite  helpful  in  simulating  from  unusual  distributions 
and  in  particular  from  those  with  finite  support.  For  instance,  the  conditional 
distribution  of  ri  above  can  be  simulated  using  the  following  representation 
of  P(ri  =  k\p,  q,  m,  c2,  c3,  r2)  (0  <  k  <  f  =  min(ni  —  r2  —  c3,  n  1  -  c2)), 


ni  -  c2\  f _ q _ 1  k  fn\  —  k\ 

k  )  \(1  -  q)2(l  -  p)2  )  Vr2+c3 ) 

up  to  a  normalization  constant,  since  the  binomial  coefficients  and  the  power  in 
k  can  be  computed  for  all  values  of  k  at  once,  thanks  to  the  matrix  capabilities 
of  R,  through  the  command  lchoose.  The  above  quantity  corresponding  to 

pr=lchoose  (n=ni  —  c2  ,k=0 :  r)  +  (0  :  r)  *log(</i) 

+  lchoose (n=ni-(0 :r) ,k=r2 + C3) 

is  the  whole  vector  of  the  log-probabilities,  with  qi  =  q/(l  —  q)2{l  —  p)2. 


^  In  most  computations,  it  is  safer  to  use  logarithmic  transforms  to  reduce  the 
risk  of  running  into  overflow  or  underflow  error  messages.  For  instance,  in  the 
example  above,  the  probability  vector  can  be  recovered  by 

pr=exp(pr-max(pr) )/sum(exp(pr-max(pr) ) ) 
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while  a  direct  computation  of  exp(pr)  may  well  produce  an  Inf  value  that 

1-7 

invalidates  the  remaining  computations. 

Once  the  probabilities  are  transformed  as  in  the  previous  R  code,  a  call  to 
the  R  command 

>  sample (0 : mm , n , prob=exp (pr-max (pr) ) ) 

is  sufficient  to  provide  n  simulations  of  rq.  The  production  of  a  large  Gibbs 
sample  is  immediate: 

>  system. time (gibbscapl (10~5 ,22 , 11 ,6) ) 

user  system  elapsed 
12.816  0.000  12.830 

Even  a  large  value  such  as  n\  =  1612  used  below  does  not  lead  to  comput¬ 
ing  difficulties  since  we  can  run  10,000  iterations  of  the  corresponding  Gibbs 
sampler  in  a  few  seconds  on  a  laptop: 

>  system. time (gibbscapl (10 ~4, 1612, 811 ,236) ) 

user  system  elapsed 
10.245  0.028  10.294 


For  eurodip,  we  have  n\  =  22,  C2  =  11,  and  C3  =  6.  We  obtain  the  Gibbs 
output 

>  gg=gibbscapl (1CT5,22, 11 ,6) 

summarized  in  Fig.  5.2.  The  sequences  for  all  components  are  rather  stable 
and  their  mixing  behavior  (i.e.,  the  speed  of  exploration  of  the  support  of 
the  target)  is  satisfactory,  even  though  we  can  still  detect  a  trend  in  the  first 
three  rows.  Since  rq  and  7*2  are  integers  with  only  a  few  possible  values,  the 
last  two  rows  show  apparently  higher  jumps  than  the  three  other  parameters. 
The  MCMC  approximations  to  the  posterior  expectations  of  N  and  p  are 
equal 

>  mean(gg$N) 

[1]  57.52955 

>  mean(gg$p) 

[1]  0.3962891 

respectively. 

Given  the  large  difference  between  n\  and  C2  and  the  proximity  between 
C2  and  C3,  high  values  of  q  are  rejected,  and  the  difference  can  be  attributed 

7  This  recommendation  also  applies  to  the  computation  of  likelihoods  that  tend 
to  take  absolute  values  that  exceed  the  range  of  the  computer  representation  of 
real  numbers,  while  only  the  relative  values  are  relevant  for  Bayesian  computations. 
Using  a  transform  such  as  exp  (loglike-max  (loglike) )  thus  helps  in  reducing  the 
risk  of  overflows. 
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with  high  likelihood  to  a  poor  capture  rate.  One  should  take  into  account  the 
fact  that  there  are  only  three  observations  for  a  model  that  involves  three  true 
parameters  plus  two  missing  variables.  Figure  5.3  gives  another  insight  into 
the  posterior  distribution  by  representing  the  joint  distribution  of  the  sample 
of  (ri,  7*2  )’s 

>  plot ( j itter (gg$rl , f actor=l) , j itter (g2$r2 , f actor=l) , cex=0.5, 

+  xlab=expression(r [1] ) ,ylab=expression(r [2] ) ) 

using  for  representation  purposes  the  R  function  jitter(),  which  moves  each 
point  by  a  tiny  random  amount.  There  is  a  clear  positive  correlation  between 
ri  and  7*2,  despite  the  fact  that  r 2  is  simulated  on  an  {n\  —  C3  —  7*1)  scale.  The 
mode  of  the  posterior  is  (7*1, 7*2)  =  (0,  0),  which  means  that  it  is  likely  that  no 
dipper  died  or  left  the  observation  area  over  the  3-year  period. 


5.4  Accept— Reject  Algorithms 

In  Chap.  2,  we  mentioned  standard  random  number  generators  used  for  the 
most  common  distributions  and  presented  importance  sampling  (Algorithm 
2.2)  as  a  possible  alternative  when  such  generators  are  not  available.  While 
MCMC  algorithms  always  offer  a  solution  when  facing  nonstandard  distri¬ 
butions,  there  often  exists  a  possibility  that  is  in  fact  used  in  most  of  the 
standard  random  generators  and  which  we  now  present.  It  also  relates  to  the 
independent  Metropolis-Hastings  algorithm  of  Sect.  4.2.2. 

Given  a  density  g  that  is  defined  on  an  arbitrary  space  (of  any  dimen¬ 
sion),  a  fundamental  identity  is  that  simulating  X  distributed  from  g{pc)  is 
completely  equivalent  to  simulating  (A,  U)  uniformly  distributed  on  the  set 

A^7  =  {(#,  u)  :  0  <  u  <  g(x)} 

(this  is  called  the  Fundamental  Theorem  of  Simulation  in  Robert  and  Casella, 
2004,  Chap.  3).  The  reason  for  this  equivalence  is  simply  that 

00 

■^0 <u<g(x)  drt  q{x)  • 

Since  A^  usually  has  complex  features,  direct  simulation  from  the  uniform 
distribution  on  A^7  is  most  often  impossible  (Exercise  5.16).  The  idea  behind 
the  accept-reject  method  is  to  find  a  simpler  set  that  contains  A^7,  ST  C 
and  then  to  simulate  uniformly  on  this  set  until  the  value  belongs  to  AC 
In  practice,  this  means  that  one  needs  to  find  an  upper  bound  on  g;  that  is, 
another  density  /  and  a  constant  M  such  that 

g(x)  <  Mf(pc )  (5.5) 

on  the  support  of  the  density  g.  (Note  that  M  >  1  necessarily.)  Implementing 
the  following  algorithm  then  leads  to  a  simulation  from  g. 
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Fig.  5.2.  Dataset  eurodip:  Representation  of  the  Gibbs  sampling  output  for  the 
five  parameters  of  the  open  population  model,  based  on  10,000  iterations,  with  raw 
plots  (first  column)  and  histograms  (second  column) 


Algorithm  5.9  Accept-Reject  Sampler 

1.  Generate  X  /.  u  %),!]■ 

2.  Accept  Y  =  x  if  u  <  g(x)/(M  f(x)). 

3.  Return  to  1  otherwise. 


This  method  provides  a  random  generator  for  densities  g  that  are  known 
up  to  a  multiplicative  factor,  which  is  a  feature  that  occurs  particularly  often 
in  Bayesian  calculations  since  the  posterior  distribution  is  usually  specified  up 
to  a  normalizing  constant. 
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Fig.  5.3.  Dataset  eurodip:  Representation  of  the  Gibbs  sampling  output  of  the 
(ri,r2)’s  by  a  jitterplot:  to  translate  the  density  of  the  possible  values  of  (ri,r2)  on 
the  N  grid,  each  simulation  has  been  randomly  moved  using  the  R  jitter  procedure 
and  colored  at  random  using  grey  levels  to  help  distinguish  the  various  simulations 


For  the  open  population  model,  we  found  the  full  conditional  distribution 
of  ri  to  be  rather  non-standard,  as  shown  by  (5.4).  Rather  than  using  an 
exhaustive  enumeration  of  all  probabilities  P(mi  =  k)  =  g(k)  and  then  sam¬ 
pling  from  this  distribution,  we  can  instead  try  to  use  a  proposal  based  on  a 
binomial  upper  bound.  Take  for  instance  /  that  corresponds  to  the  binomial 
distribution  ^(r,  <72)  with 

q2=q/{q+(l-q)2(l-p)2}. 

The  ratio  g(k)/ f{k)  is  proportional  to 

(vied ..  (■».-*)! 

(£)  (max(ni  -  c2,  n\  -  r2  -  c3)  -  k)\  ’ 

which  is  decreasing  in  k.  The  ratio  is  therefore  bounded  by 
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fn1-c2\(n1—0\  /  \  | 

\0/  \r2-\-c3)  _  (^1  ^2)* 

(0)  (r2  +c3)!(rii  -r2  -  c3)! 

(up  to  the  same  normalizing  constant).  Note  that  this  is  not  the  constant  M 
introduced  in  Algorithm  5.9  because  we  use  unnormalized  densities  (the  bound 
M  may  therefore  also  depend  on  <72  )•  Therefore  we  cannot  derive  the  average 
acceptance  rate  from  this  ratio  and  we  have  to  use  a  Monte  Carlo  experiment 
to  check  whether  or  not  the  method  is  really  efficient  (see  Exercise  5.20). 

If  we  use  the  values  from  eurodip — that  is,  n\  =  22,  C2  =  11  and  C3  =  6, 
with  7*2  =  1  and  q\  =  0.1 — ,  we  can  use  R  functions  like 

thresh=function(k,nl , c2 , c3 ,r2 ,barr) { 

choose (nl-c2 ,k) *choose (nl-k, c3+r2) /choose (barr ,k) 

> 

ardipper=function(nsimu=l ,nl ,c2,c3,r2,q2){ 

barr=min(nl-c2 ,nl-r2-c3) 
boundM=thresh (0 , nl , c2 , c3 , r2 , barr) 
echan=l :nsimu 
for  (i  in  l:nsimu){ 
test=TRUE 
while  (test){ 

y=rbinom(l , size=barr ,prob=q2) 

test= (runif (1) >thresh(y ,nl , c2 , c3 ,r2 ,barr) ) 

> 

echan  [i] =y 

> 

echan 

> 

the  average  of  the  acceptance  ratios  g(k)/M f(k)  is  equal  to  0.12.  This  is  a 
relatively  small  value  since  it  corresponds  to  a  rejection  rate  of  about  9/10. 
The  simulation  process  could  thus  be  a  little  slow,  although 

>  system .time (ardipper (10~5 ,nl=22 , c2=ll , c3=6 ,r2=l ,ql= . 1) ) 
user  system  elapsed 
8.148  0.024  8.1959 

shows  this  is  not  the  case.  (Note  that  the  code  ardipper  provided  here  does 
not  produce  the  rejection  rate.  It  has  to  be  modified  for  this  purpose.)  An 
histogram  of  accepted  values  is  shown  in  Fig.  5.4. 

Obviously,  this  method  is  not  hassle-free.  For  complex  densities  g,  it  may 
prove  impossible  to  find  a  density  /  such  that  g{pc)  <  Mf(x)  and  M  is  small 
enough.  However,  there  exists  a  large  class  of  univariate  distributions  for  which 
a  generic  choice  of  /  is  possible  (see  Robert  and  Casella,  2004,  Chap.  2). 
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n 

Fig.  5.4.  Dataset  eurodip:  Sample  from  the  distribution  (5.4)  obtained  by  accept- 
reject  and  based  on  the  simulation  of  10,000  values  from  a  ^(m,  qi)  distribution  for 
rn  =  22,  C2  —  11,  C3  =  6,  r‘2  =  1,  and  q\  —  0.1 


5.5  The  Arnason— Schwarz  Capture— Recapture  Model 

We  consider  in  this  final  section  a  more  advanced  capture-recapture  model 
based  on  the  realistic  assumption  that,  in  most  capture-recapture  experi¬ 
ments,  we  can  tag  individuals  one  by  one;  that  is,  we  can  distinguish  each 
individual  at  the  time  of  its  first  capture  and  thus  follow  its  capture  history. 
For  instance,  when  tagging  mammals  and  birds,  differentiated  tags  can  be 
used,  so  that  there  is  only  one  individual  with  tag,  say,  23131932. 

The  Arnason-Schwarz  model  thus  considers  a  capture-recapture  experi¬ 
ment  as  a  collection  of  individual  histories.  For  each  individual  that  has  been 

8In  a  capture-recapture  experiment  used  in  Dupuis  (1995),  a  population  of 
lizards  was  observed  in  the  south  of  France  (Lozere).  When  it  was  found  that  plastic 
tags  caused  necrosis  on  those  lizards,  the  biologists  in  charge  of  the  experiment  de¬ 
cided  to  cut  a  phalange  of  one  of  the  fingers  of  the  captured  lizards  to  identify  them 
later.  While  the  number  of  possibilities,  220,  is  limited,  it  is  still  much  larger  than 
the  number  of  captured  lizards  in  this  study.  Whether  or  not  the  lizards  appreciated 
this  ability  to  classify  them  is  not  known. 
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captured  at  least  once  during  the  experiment,  individual  characteristics  of 
interest  are  registered  at  each  capture.  For  instance,  this  may  include  loca¬ 
tion,  weight,  sexual  status,  pregnancy  occurrence,  social  status,  and  so  on. 
The  probabilistic  modeling  includes  this  categorical  decomposition  by  adding 
what  we  will  call  movement  probabilities  to  the  survival  probabilities  already 
used  in  the  Darroch  open  population  model  of  Sect.  5.2.2.  From  a  theoretical 
point  of  view,  this  is  a  first  example  of  a  (partially)  hidden  Markov  model, 
a  structure  studied  in  detail  in  Chap.  7.  In  addition,  the  model  includes  the 
possibility  that  individuals  vanish  from  the  population  between  two  capture 
experiments.  (This  is  thus  another  example  of  an  open  population  model.) 

As  in  eurodip,  the  interest  that  drives  the  capture-recapture  experiment 
may  be  to  study  the  movements  of  individuals  within  a  zone  &  divided  into 
k  =  3  strata  denoted  by  1,2,3.  (This  structure  is  generic:  Zones  are  not 
necessarily  geographic  and  can  correspond  to  anything  from  social  status, 
to  HIV  stage,  to  university  degree.)  For  instance,  four  consecutive  rows  of 
possible  eurodip  (individual)  capture-recapture  histories  look  as  follows: 


45 

46 

47 

48 


0  3  00000 
0  222211 
0  200000 
2  12  10  0  0 


where  0  denotes  a  failure  to  capture.  This  means  that,  for  dipper  number  46, 
the  first  location  was  not  observed  but  this  dipper  was  captured  for  all  the 
other  experiments.  For  dippers  number  45  and  47,  there  was  no  capture  after 
the  second  time  and  thus  one  or  both  of  them  could  be  dead  (or  outside  the 
range  of  the  capture  area)  at  the  time  of  the  last  capture  experiment.  We 
also  stress  that  the  Arnason-Schwarz  model  often  assumes  that  individuals 
that  were  not  part  of  the  population  on  the  first  capture  experiments  can  be 
identified  as  such.  "*  We  thus  have  cohorts  of  individuals  that  entered  the  study 
in  the  first  year,  the  second  year,  and  so  on. 


5.5.1  Modeling 

A  description  of  the  basic  Arnason-Schwarz  model  involves  two  types  of  vari¬ 
ables  for  each  individual  i  (i  =  1, . . . ,  n)  in  the  population:  first,  a  variable 
that  describes  the  location  of  this  individual, 

zi  =  5  t  =  1?  /?")  > 

where  r  is  the  number  of  capture  periods;  and,  second,  a  binary  variable  that 
describes  the  capture  history  of  this  individual, 

■X-i  (*^(z,t)  V  ^")  • 

9This  is  the  case,  for  instance,  with  newborns  or  new  mothers  in  animal  capture 
experiments. 
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We10  assume  that  =  r  means  the  animal  i  is  alive  in  stratum  r  at  time 
t  and  that  zut)  =  f  denotes  the  case  when  the  animal  i  is  dead  at  time  t. 
The  variable  z i  is  sometimes  called  the  migration  process  of  individual  i  by 
analogy  with  the  special  case  where  one  is  considering  animals  moving  between 
geographical  zones,  like  some  northern  birds  in  spring  and  fall.  Note  that 
is  entirely  observed,  while  z^  is  not.  For  instance,  we  may  have 

Xi  =  110111000 


and 


z,  =  1  2  •  3  11 


for  which  a  possible  completed  z i  is 


Zi  =  1213112tt, 

meaning  that  the  animal  died  between  the  seventh  and  the  eighth  capture 
events.  In  particular,  the  Arnason-Schwarz  model  assumes  that  dead  animals 
are  never  observed  (although  this  type  of  assumption  can  easily  be  modi¬ 
fied  when  processing  the  model,  in  what  are  called  tag-recovery  experiments). 
Therefore  zut)  =  f  always  corresponds  to  xut\  =  0. 

Moreover,  we  assume  that  the  (x^z^’s  (i  =  1 , . . . ,  n)  are  independent 
and  that  each  random  vector  z^  is  a  Markov  chain  taking  values  in  ^  U  {f } 
with  uniform  initial  probability  on  &  (unless  there  is  prior  information  to  the 
contrary).  The  parameters  of  the  Arnason-Schwarz  model  are  thus  of  two 
kinds:  the  capture  probabilities 


Pt(r)  =  P  (x(itt)  =  1|  Z(i,t)  =  r) 

on  the  one  hand  and  the  transition  probabilities 


Qt(r,s)  =  Pp(i,t+i)  =  s\z(Ut)  =  r)  reJ?,s£J?U{t},  =  1 

on  the  other  hand.  We  derive  two  further  sets  of  parameters,  <^(r)  =  1  — </t(r,  f) 
the  survival  probabilities  and  ?/y(r,  s)  the  interstrata  movement  probabilities, 
defined  as 

qt(r,s)  =  (pt(r)  x  ^t(r,s) 

The  likelihood  corresponding  to  the  complete  observation  of  the  (x^z^’s, 
^(pi, . . .  ,pT,qi, . . .  ,gr|(xi,zi), . . . ,  (xn,zn)),  is  then  given  by 


n 


II Pt(z(i,t))X(-i’t)  {1  -  Ptiz^i't))}1  X(i't)  X 

t= 1 


r  — 1 

J_  J_  Qt{Z(i,t)i  z{i,t+ 1))  5 

t= 1 


(5.6) 


10 Covariates  registered  once  or  at  each  time  will  not  be  used  here,  although  they 
could  be  introduced  via  a  generalized  linear  model  as  in  Chap.  4,  so  we  abstain  from 
adding  further  notations  in  an  already  dense  section. 
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up  to  a  constant.  The  complexity  of  the  likelihood  corresponding  to  the  data 
actually  observed  is  due  to  the  fact  that  the  z^’s  are  not  fully  observed,  hence 
that  (5.6)  would  have  to  be  summed  over  all  possible  values  of  the  missing 
components  of  the  z^’s.  This  complexity  can  be  bypassed  by  a  simulation 
alternative  described  below  in  Sect.  5.5.2. 

The  prior  modeling  corresponding  to  these  parameters  will  depend  on  the 
information  that  is  available  about  the  population  covered  by  the  capture- 
recapture  experiment.  For  illustration’s  sake,  consider  the  use  of  conjugate 
priors 

Pt(r)  ~  &e(at(r),  bt(r)) ,  ipt(r)  ~  @e(at(r),  / 3t(r )) , 

where  the  hyperparameters,  cq(r),  bt(r)  and  so  on,  depend  on  both  time  t  and 
location  r,  and 

V^(r)  -  ^r(7t(r)) , 

a  Dirichlet  distribution,  where  ?/y(r)  =  (^(r,  with 

y^V’t(r,s)  =  1, 

and  7 t(r)  =  (yt(r,  s);  s  G  &).  The  determination  of  these  (numerous)  hyperpa¬ 
rameters  is  also  case-dependent  and  varies  from  a  noninformative  modeling, 
where  all  hyperparameters  are  taken  to  be  equal  to  1  or  1/2,  to  a  very  in¬ 
formative  setting  where  exact  values  of  these  hyperparameters  can  be  chosen 
from  the  prior  information.  The  following  example  is  an  illustration  of  the 
latter. 


Table  5.3.  Prior  information  about  the  capture  and  survival  parameters  of  the 
Arnason-Schwarz  model,  represented  by  prior  expectation  and  prior  confidence 
interval,  for  a  capture-recapture  experiment  on  the  migrations  of  lizards  ( source : 
Dupuis,  1995) 


Episode 

2 

3 

4 

5 

6 

pt  Mean 

0.3 

0.4 

0.5 

0.2 

0.2 

95%  cred.  int. 

[0.1, 0.5] 

[0.2,  0.6] 

[0.3, 0.7] 

[0.05,0.4] 

[0.05,  0.4] 

Site 

A 

B,C 

Episode 

t  =  1,3,5 

t  =  2,4 

t  =  1,3,5 

t  =  2,4 

pt(r)  Mean 

0.7 

0.65 

0.7 

0.7 

95%  cred.  int. 

[0.4,  0.95] 

[0.35,  0.9] 

[0.4,0.95] 

[0.4,  0.95] 

Example  5.1.  For  the  capture-recapture  experiment  described  in  Footnote  8 
on  the  migrations  of  lizards  between  three  adjacent  zones,  there  are  six  capture 
episodes.  The  prior  information  provided  by  the  biologists  on  the  capture 
and  survival  probabilities,  pt  (which  are  assumed  to  be  zone  independent) 
and  g7(r),  is  given  by  Table  5.3.  While  this  may  seem  very  artificial,  this 
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construction  of  the  prior  distribution  actually  happened  that  way  because  the 
biologists  in  charge  were  able  to  quantify  their  beliefs  and  intuitions  in  terms  of 
prior  expectation  and  prior  confidence  interval.  (The  differences  in  the  prior 
values  on  pt  are  due  to  differences  in  capture  efforts,  while  the  differences 
between  the  group  of  episodes  1,  3  and  5,  and  the  group  of  episodes  2  and 
4  are  due  to  the  fact  that  the  odd  indices  correspond  to  spring  and  the  even 
indices  to  fall  and  mortality  is  higher  over  the  winter.)  Moreover,  this  prior 
information  can  be  perfectly  translated  in  a  collection  of  beta  priors  by  the 
R  divide-and-conquer  function 

pr obet =f unct ion (  a ,  b ,  c , alpha) { 
coc= (1-c) / c 

pbeta(b, alpha, alpha*coc) -pbeta(a, alpha, alpha*coc) 

} 

solbeta=function(a,b ,  c  ,pr ec=10"  (-3)  )  { 

coc= (1-c) / c 
detail=alpha=l 

while  (probet (a,b , c , alpha) <. 95)  alpha=alpha+detail 
while  (abs (probet (a,b , c , alpha) - . 95) >prec) { 

alpha=max (alpha-detail , detail/10) 
detail=detail/ 10 

while  (probet (a,b , c , alpha) <. 95)  alpha=alpha+detail 

> 

1 i st ( alpha=alpha , bet  a=alpha* c  oc ) 

} 

(see  Exercise  5.23  for  details).  Repeated  calls  to  solbeta  as  in 

>  solbeta( . 1 , . 5 , . 3 , 10" (-4) ) 

[1]  5.45300  12.72367 

then  leads  to  the  hyperparameters  given  in  Table  5.4.  ◄ 


Table  5.4.  Hyperparameters  of  the  beta  priors  corresponding  to  the  information 
contained  in  Table  5.3  ( source :  Dupuis,  1995) 


Episode 

2  3  4  5  6 

Dist. 

38e(b,  13)  £$e(8, 12)  ^e(12, 12)  ^e(3.5, 14)  ^e(3.5, 14) 

Site 

Episode 

A 

t  =  1,3,5  t  =  2,4 

B 

£  =  1,3,5  £  =  2,4 

Dist. 

^e(6.0,2.5)  ^e(6.5,3.5) 

^e(6.0, 2.5)  ^e(6.0,2.5) 
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5.5.2  Gibbs  Sampler 

Given  the  presence  of  missing  data  in  the  Arnason-Schwarz  model,  a  Gibbs 
sampler  is  a  natural  solution  to  handle  the  complexity  of  the  likelihood.  It 
needs  to  include  simulation  of  the  missing  components  in  the  vectors  z^  in 
order  to  simulate  the  parameters  from  the  full  conditional  distribution 

7r(0|x,  z)  (X  ^(0|x,  z)  X  7 t(0)  , 


Algorithm  5.10  Arnason-Schwarz  Gibbs  Sampler 

Iteration  l  ( l  >  1): 

1.  Parameter  simulation 

Simulate  0^  ~  7r(#|z^-1),  x)  as  (t  =  1, . . . ,  r), 

Pt\r)  |x,  z^_1)  ~  (at(r)  +  ut(r),  bt(r)  +  v^(r)^  , 


<Pt  (r)\x,z(l  1}  I  at(r)  +J^wf)(r,j),/3t(r)  +wf)(r,f)  )  , 

\  ieA 

^^(r)|x,  z^-1^  ~  S>ir  (r,  s)  +  wtp  (r,  s);sG^j  , 


where 


wll\r,s)  =  y; .(,-1) 


n 


i=l 

n 


(zy  ’  =r  zy  ;  =sl  ’ 
'A(Ct)  '’zO,t+i) 


■"!>)  =  ElW„.o, 


i=l 

n 


-f)  ’ 


i— 1 


2.  Missing  location  simulation 

Generate  the  unobserved  z}l\^s  from  the  full  conditional  distributions 

(w 

pCh) =  six(i.i)>2((i,2)1)>0(O) a  -  pfo) . 

x9t(s^(h+i))(i  -Pi!)(s))> 


(0 

(h^) 


0 


,(«)  ,AI) 
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where  x  and  z  denote  the  collections  of  the  vectors  of  capture  indicators  and 
locations,  respectively.  This  is  thus  a  particular  case  of  data  augmentation , 
where  the  missing  data  z  are  simulated  at  each  step  t  in  order  to  reconstitute 
a  complete  sample  (x,  z^)  for  which  conjugacy  applies.  In  the  setting  of  the 
Arnason-Schwarz  model,  we  can  simulate  the  full  conditional  distributions 
both  of  the  parameters  and  of  the  missing  components.  The  Gibbs  sampler  is 
as  follows: 

Note  that  simulating  the  missing  locations  in  the  z^’s  conditionally  on  the 
other  locations  and  on  the  parameters  is  not  a  very  complex  task  because 
of  the  good  conditioning  properties  of  these  vectors  (which  stem  from  their 
Markovian  nature).  As  shown  in  Step  2  of  Algorithm  5.10,  the  full  conditional 
distribution  of  only  depends  on  the  previous  and  next  locations  1) 
and  Z(ij+ 1)  (and  obviously  on  the  fact  that  it  is  not  observed;  that  is,  that 
x(i,t)  =  0).  The  corresponding  part  of  the  R  code  is  based  on  a  latent  matrix 
containing  the  current  values  of  both  the  observed  and  missing  locations: 

for  (i  in  l:n){ 

if  (z[i,l]==0)  latent [i , 1] =sample (1 : (m+1) , 1 , 
prob=q [ , latent [i , 2] ] * ( 1-c (p [s , ] , 0) ) ) 
for  (t  in  ( (2 : (T-l) )  [z  [i , -c (1 . T)] ==0] ) ) 
latent [i , t] =sample (1 : (m+1) , 1 , 

prob=q  [latent  [i ,  t-l]  ,]  *q[,  latent  [i  ,t+l]  ]*  (1-c  (p  [s ,]  ,0))) 
if  (z[i,T]==0)  latent [i ,T] =sample (1 : (m+1) , 1 , 
prob=q [latent [i,T-l] ,] * (1-c (p [s ,] ,0))) 

1 

(The  convoluted  range  for  the  inner  loop  replaces  an  if  (z  [i ,  t]  ==0) .) 
When  the  number  of  states  s  G  &  is  moderate,  it  is  straightforward  to  simulate 
from  such  a  distribution. 

Take  &  =  {1,2},  n  =  4,  m  =  8  and  assume  that,  for  x,  we  have  the 
following  histories: 


1 

2 

3 

4 


1  1  .  .  1  .  .  . 
1  •  1  •  1  •  2  1 

2  1  •  1  2  •  •  1 

1  •  -12  112 


Assume  also  that  all  (prior)  hyperparameters  are  taken  equal  to  1.  Then  one 
possible  instance  of  a  simulated  z  is 


1112  112} 
1112  1112 
2  12  12  111 
12  112  112 
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and  it  leads  to  the  following  simulation  of  the  parameters: 

p^(l)|x,  z^_1)  —  3Se(\  +  2, 1  +  0) , 

<Py\ 2)|x,  ~  38e(\  +  0, 1  +  1) , 

^2^(1,  2)|x,  z^-1^  ~  38e(\  +  1,1  +  2), 

in  the  Gibbs  sampler,  where  the  hyperparameters  are  therefore  derived  from 
the  (partly)  simulated  history  above.  Note  that  because  there  are  only  two 
possible  states,  the  Dirichlet  distribution  simplifies  into  a  beta  distribution. 


0  2000  4000  6000  8000  10000  0.4  0.5  0.6  0.7 

Fig.  5.5.  Dataset  eurodip:  Representation  of  the  Gibbs  sampling  output  for  some 
parameters  of  the  Arnason-Schwarz  model,  based  on  10,000  iterations,  with  raw 
plots  (first  column)  and  histograms  ( second  column) 
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For  eurodip,  Lebreton  et  al.  (1992)  argue  that  the  capture  and  survival 
rates  should  be  constant  over  time.  If  we  assume  that  the  movement  probabil¬ 
ities  are  also  time  independent,  we  are  left  with  3-j-3  +  3x2  =  12  parameters. 
Figure  5.5  gives  the  Gibbs  output  for  the  parameters  p(  1),  2),  and  ^(3,3) 

using  noninformative  priors  with  a{r)  =  b(r)  =  a(r)  =  (3(r)  =  7 (r,  s)  =  1. 
The  simulation  of  the  parameters  is  obtained  by  the  following  piece  of  R  code, 
where  s  is  the  current  index  of  the  Gibbs  iteration  in  the  R  code  below: 

for  (rl  in  l:m){ 

for  (r2  in  1: (m+1)) 

omega [r2] =sum(latent [, 1 : (T-l) ] ==rl  &  latent [, 2 : T] ==r2) 
u=sum (z ! =0  &  latent==rl) 

v=sum(z==0  &  latent==rl) 
p [s ,rl] =rbeta(l , 1+u, 1+v) 

phi [s ,rl] =rbeta(l , 1+sum (omega [1 :m] ) , l+omega[m+l] ) 
psi [rl , , s] =rdirichlet (1 ,rep(l ,m)+omega[l :m] ) 

} 

The  transition  probabilities  qt{r,s)  are  then  reconstructed  from  the  survival 
and  movement  probabilities,  with  the  special  case  of  the  m+1  column  corre¬ 
sponding  to  the  absorbing  f  state: 

tt=matrix(rep(phi  [s,] ,m) ,m,byrow=T) 
q=rbind(tt*psi  [,  ,s] ,rep(0,m) ) 
q=cbind(q, 1-apply (q, 1 , sum) ) 

The  convergence  of  the  Gibbs  sampler  to  the  region  of  interest  occurs  very 
quickly,  even  though  we  can  spot  an  approximate  periodicity  in  the  raw  plots 
on  the  left-hand  side.  The  MCMC  approximations  of  the  estimates  of  p(  1), 
<£>(2),  and  ^(3,  3),  the  empirical  mean  over  the  last  8,000  simulations,  are  equal 
to  0.25,  0.99,  and  0.61,  respectively. 


5.6  Exercises 

5.1  Show  that  the  posterior  distribution  7r(TV|n+)  given  by  (5.1),  while  associated  with 
an  improper  prior,  is  defined  for  all  values  of  n+.  Show  that  the  normalization  factor  of 
(5.1)  is  n+  V  1,  and  deduce  that  the  posterior  median  is  equal  to  2(n+  V  1)  —  1.  Discuss 
the  relevance  of  this  estimator  and  show  that  it  corresponds  to  a  Bayes  estimate  of  p 
equal  to  1/2. 
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5.2  Under  the  same  prior  as  in  Sect.  5.2.1,  derive  the  marginal  posterior  density  of  N 
in  the  case  where  ~  &(N,p)  and 

are  observed  (the  later  are  in  fact  recaptures).  Apply  to  the  sample 

(ji\  5  ^2" 5  •  •  •  5  nii )  —  (32,  20,  8,  5, 1,2, 0,2, 1,1,0) , 

which  describes  a  series  of  tag  recoveries  over  11  years. 


5.3  Show  that  the  conditional  distribution  of  m2  conditional  on  both  sample  sizes 
m  and  n 2  is  given  by  (5.2)  and  does  not  depend  on  p.  Deduce  the  expectation 

E7r[77l2|m,  712,  IV]. 


5.4  In  order  to  determine  the  number  N  of  buses  in  a  town,  a  capture-recapture 
strategy  goes  as  follows.  We  observe  m  =  20  buses  during  the  first  day  and  keep 
track  of  their  identifying  numbers.  Then  we  repeat  the  experiment  the  following  day  by 
recording  the  number  of  buses  that  have  already  been  spotted  on  the  previous  day,  say 
777/2  =  5,  out  of  the  n2  =  30  buses  observed  the  second  day.  For  the  Darroch  model, 
give  the  posterior  expectation  of  N  under  the  prior  tt(N)  =  1/N. 

A 

5.5  Show  that  the  maximum  likelihood  estimator  of  N  for  the  Darroch  model  is  N  = 
TT/i /  (7722/722),  and  deduce  that  it  is  not  defined  when  m2  =  0. 


5.6  Give  the  likelihood  of  the  extension  of  Darroch’s  model  when  the  capture-recapture 
experiments  are  repeated  K  times  with  capture  sizes  and  recapture  observations  rik 
(1  <  k  <  K )  and  mk  (2  <  k  <  K),  respectively.  ( Hint.  Exhibit  first  the  two-dimensional 
sufficient  statistic  associated  with  this  model.) 

5.7  Give  both  conditional  posterior  distributions  involved  in  Algorithm  5.8  in  the  case 

n+  =  0. 


5.8  Show  that,  for  the  two-stage  capture  model  with  probability  p  of  capture,  when  the 
prior  on  N  is  a  &(X)  distribution,  the  conditional  posterior  on  N  —  72 +  is  «^(A(1  —p)2). 


5.9  Reproduce  the  analysis  of  eurodip  summarized  by  Fig.  5.1  when  switching  the  prior 
from  7r(N,p)  oc  \n /N\  to  7r (N,p)  oc  iV-1 . 

5.10  An  extension  of  the  T-stage  capture-recapture  model  of  Sect.  5.2.3  is  to  consider 
that  the  capture  of  an  individual  modifies  its  probability  of  being  captured  from  p  to  q 
for  future  recaptures.  Give  the  likelihood  £(N,p,q\ 721, 722, 7722  . . .  ,72t,272t)- 


5.11  Another  extension  of  the  two-stage  capture-recapture  model  is  to  allow  for  mark 
loss.11  If  we  introduce  q  as  the  probability  of  losing  the  mark,  r  as  the  probability  of 
recovering  a  lost  mark  and  k  as  the  number  of  recovered  lost  marks,  give  the  associated 
likelihood  £(N,p,  q,  7/721, 722, 7722,  k). 


11  Tags  can  be  lost  by  marked  animals,  but  the  animals  themselves  could  also  be 
lost  to  recapture  either  by  changing  habitat  or  dying.  Our  current  model  assumes 
that  the  population  is  closed ;  that  is,  that  there  is  no  immigration,  emigration,  birth, 
or  death  within  the  population  during  the  length  of  the  study.  These  other  kinds  of 
extension  are  dealt  with  in  Sects.  5.3  and  5.5. 
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5.12  Show  that  the  conditional  distribution  of  n  in  the  open  population  model  of 
Sect.  5.3  is  proportional  to  the  product  (5.4). 

5.13  Show  that  the  distribution  of  r2  in  the  open  population  model  of  Sect.  5.3  can  be 
integrated  out  from  the  joint  distribution  and  that  this  leads  to  the  following  distribution 
on  n : 


7r(n|p,  <?,ni,c2,c3)  oc 


(n i  ~  ri)!(ni  -  n  -  c3)! 

ri\(ni  —  ri  —  C2)! 


( _ « _ 

V  (1  -  p)(i  -  q)[q  +  (1  -p)(i  -  q)] 


r  1 


Compare  the  computational  cost  of  a  Gibbs  sampler  based  on  this  approach  with  a  Gibbs 
sampler  using  the  full  conditionals. 


5.14  Show  that  the  likelihood  associated  with  an  open  population  as  in  Sect.  5.3  can 
be  written  as 


T  N 

£(N,p\s>*)=  nnc)(i-v1,)i-ett 

ieit  ^ it) it  t  ^  1 

p)(i"e<t)(i ~5it) , 

where  qo  —  q,  —  1,  and  Sit  and  en  are  the  capture  and  exit  indicators,  respectively. 
Derive  the  order  of  complexity  of  this  likelihood;  that  is,  the  number  of  elementary 
operations  necessary  to  compute  it.12 

5.15  In  connection  with  the  presentation  of  the  accept-reject  algorithm  in  Sect.  5.4, 
show  that,  for  M  >  0,  if  g  is  replaced  with  Mg  in  M  and  if  (X,U)  is  uniformly 
distributed  on  M,  the  marginal  distribution  of  X  is  still  g.  Deduce  that  the  density  g 
only  needs  to  be  known  up  to  a  normalizing  constant. 

5.16  For  the  function  g(x)  =  (1  +  sin2(x))(2  +  cos4(4x))  exp[— x4{l  +  sin6(x)}]  on 

[0,  27t],  examine  the  feasibility  of  running  a  uniform  sampler  on  the  set  5?  associated 
with  the  accept-reject  algorithm  in  Sect.  5.4. 

5.17  Show  that  the  probability  of  acceptance  in  Step  2  of  Algorithm  5.9  is  1/M  and 
that  the  number  of  trials  until  a  variable  is  accepted  has  a  geometric  distribution  with 
parameter  1/M.  Conclude  that  the  expected  number  of  trials  per  simulation  is  M. 

5.18  For  the  conditional  distribution  of  at  derived  from  (5.3),  construct  an  accept- 
reject  algorithm  based  on  a  normal  bounding  density  /  and  study  its  performances  for 

N  =  532,  nt  =  118,  nt  =  -0.5,  and  a2  =  3. 

5.19  When  uniform  simulation  on  the  accept-reject  set  M  of  Sect.  5.4  is  impossible, 
construct  a  Gibbs  sampler  based  on  the  conditional  distributions  of  u  and  x.  (Hint: 
Show  that  both  conditionals  are  uniform  distributions.)  This  special  case  of  the  Gibbs 
sampler  is  called  the  slice  sampler  (see  Robert  and  Casella,  2004,  Chap.  8).  Apply  to  the 
distribution  of  Exercise  5.16. 

12  We  will  see  in  Chap.  7  a  derivation  of  this  likelihood  that  enjoys  an  O(T)  com¬ 
plexity. 
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5.20  Show  that  the  normalizing  constant  M  of  a  target  density  /  can  be  deduced  from 
the  acceptance  rate  in  the  accept-reject  algorithm  (Algorithm  5.9)  under  the  assumption 
that  g  is  properly  normalized. 

5.21  Reproduce  the  analysis  of  Exercise  5.20  for  the  marginal  distribution  of  n  com¬ 
puted  in  Exercise  5.13. 

5.22  Modify  the  function  ardipper  used  in  Sect.  5.4  to  return  the  acceptance  rate  as 
well  as  a  sample  from  the  target  distribution. 

5.23  Show  that,  given  a  mean  and  a  95%  confidence  interval  in  [0, 1],  there  exists  at 
most  one  beta  distribution  £$e(a,b)  with  such  a  mean  and  confidence  interval. 

5.24  Show  that,  for  the  Arnason-Schwarz  model,  groups  of  consecutive  unknown  lo¬ 
cations  are  independent  of  one  another,  conditional  on  the  observations.  Devise  a  way 
to  simulate  these  groups  by  blocks  rather  than  one  at  a  time;  that  is,  using  the  joint 
posterior  distributions  of  the  groups  rather  than  the  full  conditional  distributions  of  the 
states. 
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I  must  have  missed  something. 

Ian  Rankin,  The  Hanging  Garden. — 


Roadmap 

This  chapter  covers  a  class  of  models  where  a  rather  simple  distribution  is  made 
more  complex  and  less  informative  by  a  mechanism  that  mixes  together  several 
known  or  unknown  distributions.  This  representation  is  naturally  called  a  mix¬ 
ture  of  distributions,  as  illustrated  above.  Inference  about  the  parameters  of  the 
elements  of  the  mixtures  and  the  weights  is  called  mixture  estimation,  while  re¬ 
covery  of  the  original  distribution  of  each  observation  is  called  classification  (or, 
more  exactly,  unsupervised  classification  to  distinguish  it  from  the  supervised  clas¬ 
sification  to  be  discussed  in  Chap.  8). 

Both  aspects  almost  always  require  advanced  computational  tools  since  even 
the  representation  of  the  posterior  distribution  may  be  complicated.  Typically, 
Bayesian  inference  for  these  models  was  not  correctly  treated  until  the  intro¬ 
duction  of  MCMC  algorithms  in  the  early  1990s.  This  chapter  also  covers  the 
case  of  a  mixture  with  an  unknown  number  of  components,  for  which  a  specific 
approximation  of  the  Bayes  factor  was  designed  by  Chib  (1995). 


J.-M.  Marin  and  C.P.  Robert,  Bayesian  Essentials  with  R ,  Springer  Texts 
in  Statistics,  DOI  10. 1007/978- 1-4614-8687-9_6, 
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6.1  Missing  Variable  Models 

In  some  cases,  the  complexity  of  a  model  originates  from  the  fact  that  some 
piece  of  information  about  an  original  and  more  standard  (simpler)  model  is 
missing.  For  instance,  we  have  encountered  a  missing  variable  model  in  Chap.  5 
with  the  Arnason-Schwarz  model  (Sect.  5.5),  where  the  fact  of  ignoring  the 
characteristics  of  the  individuals  outside  their  capture  periods  makes  inference 
much  harder.  Similarly,  we  have  seen  in  Chap.  4  that  the  probit  model  can 
be  reinterpreted  as  a  missing- variable  model  in  that  we  only  observe  the  sign 
of  a  normal  variable. 

Formally,  all  models  that  are  defined  via  a  marginalization  mechanism, 
that  is,  such  that  the  density  of  the  observables  x,  /(x|0),  is  given  by  an 
integral 

/(x \0)  =  [  #(x,  z\0)  dz  ,  (6.1) 

can  be  considered  as  belonging  to  a  missing  variable  (or  missing  data )  model. 

This  chapter  focus  on  the  case  of  the  mixture  model,  which  is  the  archetyp¬ 
ical  missing-variable  model  in  that  its  simple  representation  (and  interpreta¬ 
tion)  is  mirrored  by  a  need  for  complex  processing.  Later,  in  Chap.  7,  we 
will  also  discuss  hidden  Markov  models  that  add  to  the  missing  structure  a 
temporal  dependence  dimension. 


Although  image  analysis  is  the  topic  of  Chap.  8,  the  dataset  used  in  this 
chapter  is  derived  from  an  image  of  a  license  plate,  called  License  and  not 
available  in  bayess,  as 

>  image (license , col=grey (0 : 255/255) ,axes=FALSE,xlab=" " , 

ylab=" ") 

represented  in  Fig.  6.1  (top).  The  actual  histogram  of  the  grey  levels  is  con¬ 
centrated  on  256  values  because  of  the  poor  resolution  of  the  image,  but  we 
transformed  the  original  data  as 

>  license=scan( "license .txt") 

>  license=j itter (license , 10) 

>  datha=log( (license-min (license) + . 01) / 

+  (max (license) + . 01-license) ) 

where  jitter  is  used  to  randomize  the  dataset  and  avoid  repetitions  (as 
already  described  on  page  156).  The  second  line  of  code  is  a  logit  transform. 


xThis  is  not  a  definition  in  the  mathematical  sense  since  all  densities  can  formally 
be  represented  that  way.  We  thus  stress  that  the  model  itself  must  be  introduced 
that  way.  This  point  is  not  to  be  mistaken  for  a  requirement  that  the  variable  z  be 
meaningful  for  the  data  at  hand.  In  many  cases,  for  instance  the  probit  model,  the 
missing  variable  representation  remains  formal. 
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The  transformed  data  used  in  this  chapter  has  been  stored  in  the  hie 

datha.txt. 

>  data(datha) 

>  datha=as . matrix (datha) 

>  hist (datha, nclas=200 , xlab=" " ,xlim=c (min(datha) , max (datha) ) , 

ylab=" " ,prob=TRUE,main=" ") 

As  seen  from  Fig.  6.1  (bottom),  the  resulting  structure  of  the  data  is  compat¬ 
ible  with  a  sample  from  a  mixture  of  several  normal  distributions  (with  at 
least  two  components).  We  point  out  at  this  early  stage  that  mixture  model¬ 
ing  is  often  used  in  image  smoothing.  Unsurprisingly,  the  current  plate  image 
would  instead  require  feature  recognition,  for  which  this  modeling  does  not 
help,  because  it  requires  spatial  coherence  and  thus  more  complicated  models 
that  will  be  presented  in  Chap.  8. 


-10  -5  0  5  10 


Fig.  6.1.  Dataset  License:  (top)  Image  of  a  car  license  plate  and  ( bottom )  histogram 
of  the  transformed  grey  levels  of  the  dataset 
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6.2  Finite  Mixture  Models 


We  now  introduce  the  specific  case  of  mixtures  as  it  exemplifies  the  complexity 
of  missing- variable  models,  both  by  its  nature  (in  the  sense  that  it  is  inherently 
linked  with  a  missing  variable)  and  by  its  processing,  which  also  requires  the 
incorporation  of  the  missing  structure.2 

A  mixture  distribution  is  a  convex  combination 

k  k 

,  Pj  >  o ,  XA  =  1 , 

3= 1  J  =  1 


of  k  distributions  fj  (k  >  1).  In  the  simplest  situations,  the  f3  ’s  are  known  and 
inference  focuses  either  on  the  unknown  proportions  pj  or  on  the  allocations 
of  the  points  of  the  sample  (aq, . . . ,  xn)  to  the  components  fj,  i.e.  on  the 
probability  that  aq  is  generated  from  fj  by  opposition  to  being  generated 
from  /^,  say.  In  most  cases,  however,  the  /j’s  are  from  a  parametric  family 
like  the  normal  or  Beta  distributions,  with  unknown  parameters  6j,  leading 
to  the  mixture  model 

k 


3  = 1 


with  parameters  including  both  the  weights  Pj  and  the  component  parameters 
Oj  (j  =  1 ,  It  is  actually  relevant  to  distinguish  the  weights  pj  from 

the  other  parameters  in  that  they  are  solely  associated  with  the  missing- 
data  structure  of  the  model,  while  the  others  are  related  to  the  observations. 
This  distinction  is  obviously  irrelevant  in  the  computation  of  the  likelihood 
function  or  in  the  construction  of  the  prior  distribution,  but  it  matters  in  the 
interpretation  of  the  posterior  output,  for  instance. 

There  are  several  motivations  for  considering  mixtures  of  distributions  as 
a  useful  extension  to  “standard”  distributions.  The  most  natural  approach  is 
to  envisage  a  dataset  as  made  of  several  latent  (that  is,  missing,  unobserved) 
strata  or  subpopulations.  For  instance,  one  of  the  earliest  occurrences  of  mix¬ 
ture  modeling  can  be  found  in  Bertillon  (1887), 3  where  the  bimodal  structure 
of  the  heights  of  (military)  conscripts  in  central  France  (Doubs)  can  be  ex¬ 
plained  a  posteriori  by  the  aggregation  of  two  populations  of  young  men,  one 
from  the  plains  and  one  from  the  mountains.  The  mixture  structure  appears 
because  the  origin  of  each  observation  (that  is,  the  allocation  to  a  specific 
subpopulation  or  stratum)  is  lost.  In  the  example  of  the  military  conscripts, 
this  means  that  the  geographical  origin  of  each  young  man  was  not  recorded. 


2We  will  see  later  that  the  missing  structure  of  a  mixture  actually  need  not 
be  simulated  but,  for  more  complex  missing- variable  structures  like  hidden  Markov 
models  (introduced  in  Chap.  7),  this  completion  cannot  be  avoided. 

3 The  Frenchman  Alphonse  Bertillon  is  also  the  father  of  scientific  police  investi¬ 
gation.  For  instance,  he  originated  the  use  of  fingerprints  in  criminal  investigations. 
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Depending  on  the  setting,  the  inferential  goal  associated  with  a  sample 
from  a  mixture  of  distributions  may  be  either  to  reconstitute  the  groups  by 
estimating  the  missing  component  z,  an  operation  usually  called  classifica¬ 
tion  (or  clustering ),  to  provide  estimators  for  the  parameters  of  the  different 
groups,  or  even  to  estimate  the  number  k  of  groups. 

A  completely  different  (if  more  involved)  approach  to  the  interpretation 
and  estimation  of  mixtures  is  the  semiparametric  perspective.  This  approach 
considers  that  since  very  few  phenomena  obey  probability  laws  corresponding 
to  the  most  standard  distributions,  mixtures  such  as  (6.2)  can  be  seen  as  a 
good  trade-off  between  fair  representation  of  the  phenomenon  and  efficient 
estimation  of  the  underlying  distribution.  If  k  is  large  enough,  there  is  theo¬ 
retical  support  for  the  argument  that  (6.2)  provides  a  good  approximation  (in 
some  functional  sense)  to  most  distributions.  Hence,  a  mixture  distribution 
can  be  perceived  as  a  type  of  (functional)  basis  approximation  of  unknown  dis¬ 
tributions,  in  a  spirit  similar  to  wavelets  and  splines,  but  with  a  more  intuitive 
flavor  (for  a  statistician  at  least).  However,  this  chapter  mostly  focuses  on  the 
“parametric”  case,  that  is,  on  situations  when  the  partition  of  the  sample  into 
subsamples  with  different  distributions  fj  does  make  sense  from  the  dataset 
or  modelling  point  of  view  (even  though  the  computational  processing  is  the 
same  in  both  cases).  In  other  words,  we  consider  settings  where  clustering  the 
sample  into  strata  or  subpopulations  is  of  interest. 
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Let  us  consider  an  iid  sample  x  =  (#i, . . .  ,xn)  from  model  (6.2).  The  likeli¬ 
hood  is  such  that 

n  k 

£(0,p|x)  =  Yl^2pj  f(xi\9j) . 

i=l  j= 1 


This  likelihood  contains  kn  terms  when  the  inner  sums  are  expanded.  While 
this  expansion  is  not  necessary  for  computing  the  likelihood  at  a  given  value 
(0,p),  a  computation  that  is  feasible  in  0(nk)  operations  as  demonstrated  by 
the  representation  in  Fig.  6.2,  it  remains  a  necessary  step  in  the  understand¬ 
ing  of  the  mixture  structure.  Alas,  the  computational  difficulty  in  using  the 
expanded  version  precludes  analytic  solutions  for  either  maximum  likelihood 
or  Bayes  estimators. 


Example  6.1.  Consider  the  simple  case  of  a  two-component  normal  mixture 


1)  +  (1  -p)J^(n2, 1) ,  (6.3) 

where  the  weight  p  ^  0.5  is  known.  The  likelihood  surface  can  be  computed 
by  an  R  code  as  in  the  following  plotmix  function,  which  relies  on  the  image 
function  and  a  discretization  of  the  (/xi ,  ^2)  space  into  pixels.  Given  a  sample 
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sampl  that  is  generated  in  the  first  lines  of  the  function,  the  log-likelihood 
surface  is  computed  by 

pbar=l-p 

mul=mu2=seq(min (sampl) ,max(sampl) , . 1) 

mol=mul°/o*0/0t  (rep(l ,  length (mu2) )  ) 

mo2=rep  (1 ,  length  (mu2)  )0/0*0/0t  (mu2) 

cal=-0 . 5*mol*mol 

ca2=-0 . 5*mo2*mo2 

like=0*mol 

for  (i  in  l:n) 

like=like+log(p*exp(cal+sampl [i] *mol)+ 
pbar*exp (ca2+sampl [i] *mo2) ) 
like=like+ . 1* (cal+ca2) 

and  plotted  by 

image (mul ,mu2 , like ,xlab=expression(mu [1] ) , 
ylab=expression(mu [2] ) , col=heat . colors (250) ) 
contour (mul ,mu2 , like , levels=seq(min(like) , max (like) ,lengthl) , 
add=TRUE , drawlabels=FALSE) 

We  note  that  the  outcome  of  the  plotmix  function  is  the  list 
list (sample=sampl ,like=like) ,  used  in  subsequent  analyses  of  the  data. 
For  instance,  this  outcome,  including  the  level  sets  obtained  by  contour,  is 
provided  in  Fig.  6.2.  In  this  case,  the  parameters  are  identifiable:  p i  cannot  be 
confused  with  p^  when  p  is  different  from  0.5.  Nonetheless,  the  log-likelihood 
surface  in  this  figure  exhibits  two  modes,  one  being  close  to  the  true  value 
of  the  parameters  used  to  simulate  the  dataset  and  one  corresponding  to  an 
inverse  separation  of  the  dataset  into  two  groups.4  ◄ 

For  any  prior  7r(0,p),  the  posterior  distribution  of  (0,p)  is  available  up 
to  a  multiplicative  constant: 


7 r(0,  p|x)  oc 


n 


k 


i= 1 J=1 


7r(0,p) 


While  7r(0,  p|x)  can  thus  be  computed  for  a  given  value  of  (0,  p)  at  a  cost  of 
order  O(fcn),  we  now  explain  why  the  derivation  of  the  posterior  character¬ 
istics,  and  in  particular  of  posterior  expectations  of  quantities  of  interest,  is 
only  possible  in  an  exponential  time  of  order  O (kn). 

To  explain  this  difficulty  in  more  detail,  we  consider  the  rather  intuitive 
missing-variable  representation  of  mixture  models:  With  each  xi  is  associated 


4To  get  a  better  understanding  of  this  second  mode,  consider  the  limiting  setting 
when  p  =  0.5.  In  that  case,  there  are  two  equivalent  modes  of  the  likelihood,  (pi,  P2) 
and  (p2,pi)-  As  p  moves  away  from  0.5,  this  second  mode  gets  lower  and  lower 
compared  with  the  other  mode,  but  it  still  remains. 
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hi 

Fig.  6.2.  R  image  representation  of  the  log-likelihood  of  the  mixture  (6.3)  for  a 
simulated  dataset  of  500  observations  and  a  true  value  (/ii,/i2,p)  =  (2.5,  0,0.7). 
Besides  a  mode  (represented  by  a  diamond )  in  the  neighborhood  of  the  true  value, 
the  R  contour  function  exhibits  an  additional  mode  on  the  likelihood  surface 


a  missing  variable  Zi  that  indicates  “its”  component,  i.e.  the  index  Zi  of  the 
distribution  from  which  it  was  generated.  Formally,  this  means  that  we  have 
a  hierarchical  structure  associated  with  the  model: 


Jtkip: 


and 


x , 


Zi,0  ~  /(-|0*i) 


The  completed  likelihood  corresponding  to  the  missing  structure  is  such  that 


n 


^(0,p|x,z)  =  Y[pZi  f(Xi\0Zi) 


2=1 


and 


7 r(0,  p|x,  z)  oc 


n 


X\pzi  f(Xi\0Zi) 


.2=1 


7T(0,p)  , 


where  z  =  (zi, . . . ,  zn).  If  we  denote  by  Z  =  {1, . . . ,  k}n  the  set  of  the  kn 
possible  values  of  the  vector  z,  we  can  decompose  Z  into  a  partition  of  subsets 


2  =  0)=lZj 
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as  follows  (see  Exercise  6.2  for  the  value  of  t:  For  a  given  allocation  size  vector 
(ni, . . . ,  n&),  where  n\  + . . .  i.e.  a  given  number  of  observations  allocated 
to  each  component,  we  define  the  partition  sets 


which  consist  of  all  allocations  with  the  given  allocation  size  (ni, . . . ,  n&). 
We  label  those  partition  sets  with  j  =  j(n\, . . . ,  n/c)  by  using,  for  instance, 
the  lexicographical  ordering  on  the  (ni, . . . ,  n*;)’s.  (This  means  that  j  =  1 
corresponds  to  (ni,...,n/c)  =  (n,  0, . . . ,  0),  j  =  2  to  (ni,...,n/c)  =  (n  — 
1,1,...,  0),  j  =3  to  (ni, . . . ,  nk)  =  (n  —  1,  0, 1, ... ,  0),  and  so  on).  Using  this 
partition,  the  posterior  distribution  of  (0,  p)  can  be  written  in  closed  form  as 


7T  ( 6 ,  p|x)  =  E  7r  (0,  p|x,  z)  =  w  (z)  7T  (0,  p|x,  z)  , 


(6.5) 


;1  zGZj 


where  uj  (z)  represents  the  marginal  posterior  probability  of  the  allocation  z 
conditional  on  the  observations  x  (derived  by  integrating  out  the  parameters 
0  and  p).  With  this  representation,  a  Bayes  estimator  of  (0,p)  can  also  be 
written  in  closed  form  as 


E 


7 r 


o,  p|x]  =  E  w  (z) £7r  i0’ p 


X,  z 


i— 1  zGZi 


Continuation  of  Example  6.1.  In  the  special  case  of  model  (6.3),  if  we 
take  two  different  independent  normal  priors  on  both  means, 


fi\~  JY (0, 4) ,  /i2  ~  Jf (2, 4) , 


the  posterior  weight  of  a  given  allocation  vector  z  is 


uj  ( z)  ocy/ (ni  +  1/4)  (n  —  n\  +  1/4)  pni  (ni  —  p)n  1 

x  exp  {-[(ni  +  l/4)si  (z)  +  ni{Si  (z)}2/4]/2} 
x  exp  {-[(n  -  ni  +  1/4 )s2  (z)  +  (n  -  ni){52  (z)  -  2}2/4]/2}  , 


^  n  i  n 

x 1  (z)  =  - W  =  X2  (z)  =  - W  IZi=2^i  , 

ni  n  —  n i 

2=1  /=1 

n  n 

Sl  (z)  =  IZi  =  l  (Xi  -  Xj  (z))2  ,  S2  (z)  =  IZi=2  (Xi  -  X2  (z))2 
2=1  2=1 

(if  we  set  xi  (z)  =  0  when  n\  =  0  and  £2  (z)  =  0  when  n  —  n±  =0).  Imple¬ 
menting  this  derivation  in  R  is  quite  straightforward: 
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omega=function(z ,x,  pH 
n=length(x) 
nl=sum(z==l) ;n2=n-nl 

if  (nl==0)  xbarl=0  else  xbarl=sum( (z==l) *x) /nl 
if  (n2==0)  xbar2=0  else  xbar2=sum( (z==2) *x) /n2 
ssl=sum( (z==l) * (x-xbarl) "2) 
ss2=sum( (z==2) * (x-xbar2) "2) 

return (sqrt ( (nl+ . 25) * (n2+ . 25) ) *p~nl* (1-p) ~n2* 
exp(-( (nl+ . 25) *ssl+(n2+ . 25) *ss2) /2) * 
e  xp  (  -  (  n  1  *  xb  arl  ~  2 +n  2  *  xb  ar  2  )  /8) ) 

> 

leading  for  instance  to 

>  omega (z=sample (1:2,4, rep=TRUE) , 

+  x=plotmix (n=4 , plot=FALSE) $samp , p= . 8) 

[1]  0.0001781843 

>  omega (z=sample (1:2,4, rep=TRUE) , 

+  x=plotmix (n=4 , plot=FALSE) $sample , p= . 8) 

[1]  5 . 152284e-09 

Note  that  the  omega  function  is  not  and  cannot  be  normalized,  so  the  values 
must  be  interpreted  on  a  relative  scale.  ◄ 

The  decomposition  (6.5)  makes  a  lot  of  sense  from  an  inferential  point  of 
view.  The  posterior  distribution  simply  considers  each  possible  partition  z  of 
the  dataset,  then  allocates  a  posterior  probability  uj  (z)  to  this  partition,  and 
at  last  constructs  a  posterior  distribution  for  the  parameters  conditional  on 
this  allocation.  Unfortunately,  the  computational  burden  resulting  from  this 
decomposition  is  simply  too  intensive  because  there  are  kn  terms  in  the  sum. 

However,  there  exists  a  solution  that  overcomes  this  computational  prob¬ 
lem.  It  uses  an  MCMC  approach  that  takes  advantage  of  the  missing- variable 
structure  and  removes  the  requirement  to  explore  the  kn  possible  values  of  z 
by  only  looking  at  the  most  likely  ones. 

Although  this  is  beyond  the  scope  of  the  book,  let  us  point  out  here  that 
there  also  exists  in  the  statistical  literature  a  technique  that  predates  MCMC 
simulation  algorithms  but  still  relates  to  the  same  missing-data  structure  and 
completion  mechanism.  It  is  called  the  EM  Algorithm 5  and  consists  of  an 
iterative  but  deterministic  sequence  of  “E”  (for  expectation)  and  “M”  (for 
maximization)  steps  that  converge  to  a  local  maximum  of  the  likelihood.  At 
iteration  £,  the  “E”  step  corresponds  to  the  computation  of  the  function 


<3{(0(t),p(y,  (0,p)}  =  E(0(t)iP(t))  [log£(0,p|x,  z) 


x 


5In  non-Bayesian  statistics,  the  EM  algorithm  is  certainly  the  most  ubiquitous 
numerical  method,  even  though  it  only  applies  to  (real  or  artificial)  missing  variable 
models. 
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where  the  likelihood  ^(0,p|x,  z)  is  the  joint  distribution  of  x  and  z,  while 
the  expectation  is  computed  under  the  conditional  distribution  of  z  given  x 
and  the  value  (0^\  )  for  the  parameter.  The  “M”  step  corresponds  to  the 

maximization  of  p^),  (0,  p))  in  (0,p),  with  solution  p(t+1)). 

As  we  will  see  in  Sect.  6.4,  the  Gibbs  sampler  takes  advantage  of  exactly  the 
same  conditional  distribution.  Further  details  on  EM  and  its  Monte  Carlo 
versions  (namely,  when  the  “E”  step  is  not  analytically  feasible)  are  given  in 
Robert  and  Casella  (2004,  Chap.  5;  2009,  Chap.  5). 
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For  the  joint  distribution  (6.4),  the  full  conditional  distribution  of  z  given  x 
and  the  parameters  is  always  available  as 


n 


7r(z|x,0,p)  OC  Y[pzJ(Xi\0Zi) 


and  can  thus  be  computed  at  a  cost  of  O(n).  Since,  for  standard  distributions 
/(•  |0),  the  full  posterior  conditionals  are  also  easily  simulated  when  using 
conjugate  priors,  this  implies  that  the  Gibbs  sampler  can  be  derived  in  this 
setting.6 

If  p  and  6  are  independent  a  priori,  then,  given  z,  the  vectors  p  and  x  are 
independent;  that  is, 


7r(p|z,  x)  OC  tt(p)/(z|p)/(x|z)  OC  7t(p)/(z|p)  OC  7r(p|z)  . 


Moreover,  in  that  case,  6  is  also  independent  a  posteriori  from  p  given  z  and 
x,  with  density  7r(0|z,x).  If  we  apply  the  Gibbs  sampler  in  this  problem,  it 
involves  the  successive  simulation  of  z  and  (p,  0)  conditional  on  one  another 
and  on  the  data: 


6 Historically,  missing- variable  models  constituted  one  of  the  first  instances  where 
the  Gibbs  sampler  was  used  by  completing  the  missing  variables  by  simulation  under 
the  name  of  data  augmentation  (see  Tanner,  1996,  and  Robert  and  Casella,  2004, 
Chaps.  9  and  10). 
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Algorithm  6.11  Mixture  Gibbs  Sampler 

Initialization:  Choose  p(°)  and  6 ^  arbitrarily. 
Iteration  t  (t  >  1): 

1.  For  i  =  1, . . . ,  n,  generate  zf’  such  that 


j\0,p)  exp 


(t-i) 

3 


2.  Generate  ph)  according  to  7r(p|z^). 

3.  Generate  6^  according  to  7r(0|z^,x). 


The  simulation  of  the  pj’s  is  also  generally  obvious  since  there  exists  a 
conjugate  prior  (as  detailed  below).  In  contrast,  the  complexity  in  the  simu¬ 
lation  of  the  0/s  will  depend  on  the  type  of  sampling  density  as  well  as 

the  prior  7 r. 

The  marginal  (sampling)  distribution  of  the  Zi  s  is  a  multinomial  distri¬ 
bution  e/#/e(pi, . . .  ,pfc),  which  allows  for  a  conjugate  prior  on  p,  namely  the 
Dirichlet  distribution  p  9  (71, . . . ,  7fc),  with  density 


U71  +  •  •  •  +  7  fe) 
r(ii)  ■  ■  ■  r(yk) 


pT 


pV 


on  the  simplex  of  Mr, 


fc  ] 

(pi,  ■  ■  ■  ,Pk)  e  [0,  i]fe ;  z2pj  =  1  j  ' 


n 

In  this  case,  denoting  rij  =  Eu  =j  (1  <  j  <  k )  the  allocation  sizes,  the 

1=1 

posterior  distribution  of  p  given  z  is 


z  ~ 


9{ni  +7i,...,nfc+7fe)  . 


It  is  rather  peculiar  that,  despite  its  importance  for  Bayesian  statistics, 
the  Dirichlet  distribution  is  not  available  in  R  (at  least  in  the  standard  stat 
package).  It  is  however  fairly  straightforward  to  code,  using  a  representation 
based  on  gamma  variates,  as  shown  below. 

rdirichlet=function(n=l ,par=rep(l ,2)){ 


k=length(par) 
mat=matrix(0 ,n,k) 
for  (i  in  l:n){ 
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sim=r gamma (k, shape=par , scale=l) 
mat [i , ] =sim/sum(sim) 

> 

mat 

> 

When  the  density  f(-\0)  also  allows  for  conjugate  priors,  the  simulation 
of  6  can  be  specified  further  since  an  independent  conjugate  prior  on  each  Oj 
leads  to  independent  and  conjugate  posterior  distributions  on  the  6^ s,  given 
z  and  x. 

Continuation  of  Example  6.1.  For  the  mixture  (6.3),  under  independent 
normal  priors  JY  (5,  1/ A)  (both  S  E  R  and  A  >  0  are  fixed  hyperparameters) 
on  both  fjbi  and  /i2,  the  parameters  (i\  and  /r2  are  independent  given  (z,x), 
with  conditional  distributions 


Jf 


A5  +  niXi  (z)  1 


A  +  n\  ’  A  +  ni 


and  JY 


A 5  +  (n  —  ni)x2  (z) 


1 


A  +  n  —  ni  ’A  +  n 


n  l 


respectively.  Similarly,  the  conditional  posterior  distribution  of  the  Zi  s  given 
(/ii,/i2)  is  (i  =  1, . . .  ,n) 


l|/ii,  Xi)  oc  p  exp 


We  can  thus  construct  an  R  function  like  the  following  one  to  generate  a 
sample  from  the  posterior  distribution:  assuming  (5  =  0  and  A  =  1, 

gibbsmean=f  unction  (p  ,datha,niter=10''4)  { 


n=length (datha) 
z=rep(0,n);  ssiz=rep(0,2) 
nxj=rep (0 , 2) 

mug=matrix (mean (datha) ,nrow=niter+l ,ncol=2) 


for  (i  in  2 : (niter+1) ) { 
for  (t  in  l:n){ 

prob=c (p , 1-p) *dnorm (datha [t] ,mean=mug [i— 1 ,] ) 
z [t]  =sample (c (1 ,2) , size=l ,prob=prob) 

} 

for  (j  in  1 : 2) { 

ssiz [j] =l+sum(z==j ) 

nxj [ j ] =sum (as . numeric (z== j ) *datha) 

> 

mug [i , ] =rnorm(2 ,mean=nxj /ssiz , sd=sqrt (1/ssiz) ) 

> 

mug 

> 
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which  can  be  used  as 

>  dat=plotmix() $sample 

>  simu=gibbsmean(0 . 7,dat) 

>  points (simu,pch=" . ") 

to  produce  Figs.  6.3  and  6.4.  This  R  code  illustrates  two  possible  behaviors 
of  this  algorithm  if  we  use  a  simulated  dataset  of  500  points  from  the  mix¬ 
ture  0.7^F(0, 1)  +  0.3c/F(2.5, 1),  which  corresponds  to  the  level  sets  on  both 
pictures.  The  starting  point  in  both  cases  is  located  at  the  saddle  point  be¬ 
tween  the  two  modes,  i.e.  at  an  instable  equilibrium.  Depending  on  the  very 
first  (random)  iterations  of  the  algorithm,  the  final  sample  may  end  up  lo¬ 
cated  on  the  upper  or  on  the  lower  mode.  For  instance,  in  Fig.  6.3,  the  Gibbs 
sample  based  on  10,000  iterations  is  in  agreement  with  the  likelihood  surface, 
since  the  second  mode  discussed  in  Example  6.1  is  much  lower  than  the  mode 
where  the  simulation  output  concentrates.  In  Fig.  6.4,  the  Gibbs  sample  ends 
up  being  trapped  by  this  lower  mode.  ◄ 

Example  6.2.  If  we  consider  the  more  general  case  of  a  mixture  of  two  normal 
distributions  with  all  parameters  unknown, 

+  (1  -p)jY(h2,<jI)  > 

and  for  the  conjugate  prior  distribution  (j  =  1,2) 


hi 

Fig.  6.3.  Log-likelihood  surface  and  the  corresponding  Gibbs  sample  for  the  model 
(6.3),  based  on  10,000  iterations 
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Mi 

Fig.  6.4.  Same  legend  as  Fig.  6.3,  with  the  same  starting  point  located  at  the  saddle 
point.  In  this  instance,  the  Gibbs  sample  ends  up  around  a  lower  mode 


a]  ~  JcS{y3j2,  s2/ 2) ,  p  ~  SSeipt,  (5) , 


the  same  decomposition  conditional  on  z  and  straightforward  (if  dreary)  al¬ 
gebra  imply  that 


V 


x,  z 


~  +  ni,  /?  +  n2), 


fij  |crj,  x,  z  ~  JV 


X,  Z  ~ 


+ni)/2,si(z)/2), 


where  rij  is  the  number  of  equal  to  j .  xj(z)  and  s2A z)  are  the  empirical 
mean  and  variance  (biased)  for  the  subsample  with  Zi  equal  to  j,  and 


£i(z)  = 


tC  -  "  ■;  z ) 

lj  +  Tlj 


si(z)  =  Si  +«i«i(z)  + 


In  Tl' 


(sj  -  •'./  ( z ) ) 


The  modification  of  the  above  R  code  is  also  straightforward  and  we  do  not 
reproduce  it  here  to  save  space.  The  extension  to  more  than  two  components 
is  equally  straightforward,  as  described  below  for  License.  ◄ 
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If  we  model  License  by  a  k  =  3  component  normal  mixture  model,  we  start 
by  deriving  the  prior  distribution  from  the  scale  of  the  problem.  Namely,  we 
choose  a  ^(1/2, 1/2, 1/2)  prior  for  the  weights  (although  picking  parameters 
less  than  1  in  the  Dirichlet  prior  has  the  potential  drawback  that  it  may 
allow  very  small  weights  for  some  components),  a  ,/b(T,  cr2)  distribution  on 
the  means  /^,  and  a  (#a(10,b2)  distribution  on  the  precisions  cq-2,  where  x 
and  <j2  are  the  empirical  mean  and  variance  of  License,  respectively.  (This 
empirical  choice  of  a  prior  is  debatable  on  principle,  as  it  depends  on  the 
dataset,  but  this  is  relatively  harmless  since  it  is  equivalent  to  standardizing 
the  dataset  so  that  the  empirical  mean  and  variance  are  equal  to  0  and  1, 
respectively.)  If  we  define  the  parameter  vector  mix  as  a  list, 

>  mix=list (k=k,p=p ,mu=mu, sig=sig) 

our  R  function 

gibbsnorm=f unction (niter , mix) 

is  made  of  an  initialization  step: 

n=length(datha) ;k=mix$k 
z=rep(0,n)  #missing  data 
nxj=rep(0,k) 
ssiz=ssum=rep (0 ,k) 

mug=sigg=prog=matrix (0 , nrow=niter ,ncol=k) 
lopost=rep (0 , niter)  #log-posterior 
lik=matrix(0 ,n,k) 

prog [1 , ] =rep ( 1 , k) /k ; mug [1 , ] =rep (mix$mu , k) 
sigg [1 , ] =rep (mix$sig , k) 

#current  log-likelihood 
for  (j  in  l:k) 

lik  [ , j] =prog [1 ,  j] *dnorm(x=datha,mean=mug [1 ,  j] , 
sd=sqrt (sigg  [1 , j]  )  ) 
lopost [1] =sum (log (apply (lik , 1 , sum) ) )  + 

sum(dnorm(mug  [1 ,] ,mean(datha) , sqrt (sigg [1 ,] ) ,log=TRUE))- 
(10+1) *sum( log (sigg  [1 ,] )) -sum (var(datha) /sigg [1 ,] )  + 

. 5*sum (log (prog [1 , ] )) 

and  of  the  main  loop  for  data  completion  and  conditional  parameter  simula¬ 
tion: 

for  (i  in  1 : (niter-1) ) { 
for  (t  in  l:n){  #missing  data  completion 

prob=prog [i , ] *dnorm(datha [t] ,mug[i,] , sqrt (sigg [i ,]) ) 
if  (sum(prob)==0)  prob=rep(l ,k)/k 
z [t] =sample (1 : k, 1 ,prob=prob) 

> 
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Conditional  parameter  simulation 
for  (j  in  l:k){ 
ssiz  [j] =sum(z==j ) 

nxj [j] =sum(as .numeric (z==j ) *datha) 

> 

mug [i+1 , ] =rnorm(k, (mean(datha)+nxj)/ (ssiz+1) , 
sqrt (sigg [i ,]  / (ssiz+1) ) ) 
for  (j  in  l:k) 

ssum [j] =sum(as .numeric (z==j ) * (datha-nxj [j] /ssiz [j] ) ~2) 
sigg [i+1 , ] = 1 /r gamma (k, shape= . 5* (20+ssiz) ,rate=var (datha) + 
. 5*ssum+ . 5*ssiz/ (ssiz+1) * (mean(datha) -nxj/ssiz) "2) 
prog [i+1 , ] =rdirichlet (1 ,par=ssiz+0 . 5) 

#current  log-likelihood 
for  (j  in  l:k) 

lik [ ,  j] =prog [i+1 ,  j] *dnorm(x=datha,mean=mug [i+1 , j]  , 
sd=sqrt (sigg  [i+1, j] )) 
lopost [i+1] =sum(log (apply (lik, 1 , sum) ) )+ 
sum(dnorm(mug [i+1 ,] ,mean(datha) , sqrt (sigg [i+1 , ] ) ,log=TRUE))- 
( 10+1 )*sum (log (sigg [i+1 ,] )) -sum (var(datha) /sigg [i+1 ,] )+ 

. 5*sum (log (prog [i+1 ,] ) ) 

> 

returning  all  simulated  values  as  a  list 

list (k=k,mu=mug, sig=sigg,p=prog, lopost=lopost) 

The  output  of  this  R  function,  represented  in  Fig.  6.5  as  an  overlay  of  the 
License  histogram  is  then  produced  by  the  R  code 

mix=list (k=3 ,mu=mean(datha) , sig=var (datha) ) 
simu=gibbsnorm(1000 ,mix) 

hist (datha, prob=TRUE,main=" " ,xlab=" " ,ylab=" " ,nclass=100) 
x=y=seq(min (datha) , max (datha) ,length=150) 
yy=matrix(0 ,ncol=150 ,nrow=1000) 
for  (i  in  1:150){ 

yy [, i] =apply (simu$p*dnorm(x [i] ,mean=simu$mu, 
sd=sqrt (simu$sig) ) , 1 , sum) 
y  [i]=mean(yy  [,i]  ) 

> 

for  (t  in  501:1000) 

lines (x,yy [t ,] ,col="gold") 
lines (x,y , lwd=2 . 3, col="sienna2") 

This  output  demonstrates  that  this  crude  prior  modeling  is  sufficient  to  cap¬ 
ture  the  modal  features  of  the  histogram  as  well  as  the  tail  behavior  in  a 
surprisingly  small  number  of  Gibbs  iterations,  despite  the  large  sample  size  of 


6.4  MCMC  Solutions 


189 


2,625  points.  The  range  of  the  simulated  densities  represented  in  Fig.  6.5  re¬ 
flects  the  variability  of  the  posterior  distribution,  while  the  estimate  of  the 
density  is  obtained  by  averaging  the  simulated  densities  over  the  500  itera¬ 
tions.* * * 7 


Fig.  6.5.  Dataset  License:  Representation  of  500  Gibbs  iterations  for  the  mixture 
estimation.  (The  accumulated  lines  correspond  to  the  estimated  mixtures  at  each 
iteration  and  the  overlaid  curve  to  the  density  estimate  obtained  by  summation.) 


The  experiment  produced  in  Example  6.1,  page  184,  gives  a  false  sense 
of  security  about  the  performance  of  the  Gibbs  sampler  because  it  hides 
the  structural  dependence  of  the  sampler  on  its  initial  conditions.  The 
fundamental  feature  of  Gibbs  sampling — its  derivation  from  conditional 
distributions — implies  that  it  is  often  restricted  in  the  width  of  its  moves  and 
that,  in  some  situations,  this  restriction  may  even  jeopardize  convergence. 
In  the  case  of  mixtures  of  distributions,  conditioning  on  z  implies  that  the 
proposals  for  (0,  p)  are  quite  concentrated  and  do  not  allow  drastic  changes 
in  the  allocations  at  the  next  step.  To  obtain  a  significant  modification  of  z 
requires  a  considerable  number  of  iterations  once  a  stable  position  has  been 
reached.8  Figure  6.4  illustrates  this  phenomenon  for  the  very  same  sample 
as  in  Fig.  6.3:  A  Gibbs  sampler  initialized  at  the  saddlepoint  may  get  close 
to  the  second  mode  in  the  very  first  iterations  and  is  then  unable  to  escape 
its  (fatal)  attraction,  even  after  a  large  number  of  iterations,  for  the  reason 
given  above.  It  is  quite  interesting  to  see  that  this  Gibbs  sampler  suffers  from 
the  same  pathology  as  the  EM  algorithm.  However,  this  is  not  immensely 
surprising  given  that  it  is  based  on  a  similar  principle. 

In  general,  there  is  very  little  one  can  do  about  improving  the  Gibbs 
sampler  since  its  components  are  given  by  the  joint  distribution.  The  solu¬ 
tions  are  (a)  to  change  the  parameterization  and  thus  the  conditioning  (see 


1~7 

That  this  is  a  natural  estimate  of  the  model,  compared  with  the  “plug-in” 

density  using  the  estimates  of  the  parameters,  will  be  explained  more  clearly  in 

Sect.  6.5. 

8In  practice,  the  Gibbs  sampler  never  leaves  the  vicinity  of  a  given  mode  if  the  at¬ 
traction  of  this  mode  is  strong  enough,  for  instance  in  the  case  of  many  observations. 
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Exercise  6.6),  (b)  to  use  tempering  to  facilitate  exploration  (see  Sect.  6.7),  or 
(c)  to  mix  the  Gibbs  sampler  with  another  MCMC  algorithm. 

To  look  for  alternative  MCMC  algorithms  is  not  a  difficulty  in  this  setting, 
given  that  the  likelihood  of  mixture  models  is  available  in  closed  form,  being 
computable  in  O (kn)  time,  and  the  posterior  distribution  is  thus  known  up 
to  a  multiplicative  constant.  We  can  therefore  use  any  Metropolis-Hastings 
algorithm,  as  long  as  the  proposal  distribution  q  provides  a  correct  exploration 
of  the  posterior  surface,  since  the  acceptance  ratio 


ir(0',p' 

x)  q(0,  pi 

9',P') 

ir{6,p 

x)  q(0\p‘ 

0,P) 

can  be  computed  in  O (kn)  time.  For  instance,  we  can  use  a  random  walk 
Metropolis-Hastings  algorithm  where  each  parameter  is  the  mean  of  the  pro¬ 
posal  distribution  for  the  new  value,  that  is, 


0  —  if 

where  Uj  ~  W(0,  ( 2 )  and  (  is  chosen  to  achieve  a  reasonable  acceptance  rate. 

Continuation  of  Example  6.1.  For  the  posterior  associated  with  (6.3),  the 
Gaussian  random  walk  proposal  is 

~  (dt-1)>C2)  and  ^2  ~  (/4t_1)>C2) 

which  leads  to  an  acceptance  probability  of 


r 


min  |l,7r  (//i,/i2|  x)/ir 


(t- 1) 


(t- 

/4 


i) 


The  corresponding  R  function  is  then  of  the  form 

hmmean=funct ion (dat , niter ,var=l) { 
mu=matrix(0 , niter , 2) 
mu  [1 , ] =rnorm(2) 
for  (i  in  2:niter){ 

muprop=rnorm(2 ,mu [i— 1 , ] , sqrt (var) ) 
bound=lpost (dat ,muprop) -lpost (dat , mu [i— 1 ,  ] ) 
if  (runif (1) <=exp (bound) )  mu [i , ] =muprop  else 
mu [i , ] =mu  [i— 1 , ] 

> 

mu 

> 

used  as  in 


>  dat=plotmix() $sample 

>  simu=hmmeantemp(dat  ,niter=10''4) 

>  points (simu,pch=" ." ) 


6.4  MCMC  Solutions 


191 


when  lpost  is  the  log-posterior  density  R  function: 

lpost=function(x,mu,p=0 . 7 ,delta=0 , lambda=l) { 

sum(log(p*dnorm(x,mu [1] )  +  (l-p) *dnorm(x,mu [2] ) ) )  + 
sum (log(dnorm (mu, delta, 1/sqrt (lambda) ) ) ) 

} 

For  the  same  simulated  dataset  as  in  Fig.  6.4,  Fig.  6.6  shows  how  quickly  this 
algorithm  escapes  the  attraction  of  the  spurious  mode.  After  a  few  itera¬ 
tions  of  the  algorithm,  the  chain  drifts  away  from  the  poor  mode  and  con¬ 
verges  almost  deterministically  to  the  proper  region  of  the  posterior  surface. 
The  Gaussian  random  walk  is  scaled  as  (  =  1,  although  slightly  smaller  scales 
do  work  as  well  but  would  require  more  iterations  to  reach  the  proper  modal 
regions.  Too  small  a  scale  sees  the  same  trapping  phenomenon  appear,  as  the 
chain  does  not  have  sufficient  energy  to  escape  the  attraction  of  the  current 
mode  (see  Example  6.1,  page  199,  and  Fig.  6.8  below).  Nonetheless,  for  a  large 
enough  scale,  the  Metropolis-Hastings  algorithm  overcomes  the  drawbacks  of 
the  Gibbs  sampler.  ◄ 


hi 

Fig.  6.6.  Outcome  of  a  10,000  iteration  random  walk  Metropolis-Hastings  sample 
on  the  log-likelihood  surface;  the  starting  point  is  equal  to  (1,3).  The  scale  £  of  the 
random  walk  is  equal  to  1 


192 


6  Mixture  Models 


We  must  point  out  that,  for  constrained  parameters,  the  unconstrained 
random  walk  Metropolis-Hastings  proposal  remains  valid  but  is  not  efficient 
because  when  the  chain  (£^)  gets  close  to  the  boundary  of  the  parameter 
space,  it  moves  very  slowly,  given  that  the  proposed  values  are  often  in¬ 
compatible  with  the  constraint  and  thus  rejected  at  the  Metropolis-Hastings 
acceptance  step. 

For  instance,  this  lack  of  efficiency  has  an  impact  on  the  simulation  of 
the  weight  vector  p  since  Yli=i  Vk  —  1  hr  addition  to  positivity  constraints. 
A  practical  resolution  of  this  difficulty  is  to  overparameterize  the  weights  of 
(6.2)  into 

/  * 

Pj  =wj  /  Z-M  ’  wi  >  0  (!  <  3  <  k)  ■ 

'  i=i 


Obviously,  the  Wj’ s  are  not  identifiable,  but  this  is  not  a  difficulty  from  a  sim¬ 
ulation  point  of  view  and  the  pj  ’s  remain  identifiable  (up  to  a  permutation 
of  indices).  Perhaps  paradoxically,  using  overparameterized  representations 
often  helps  with  the  mixing  of  the  corresponding  MCMC  algorithms  since 
those  algorithms  are  less  constrained  by  the  dataset  or  by  the  likelihood.  The 
reader  may  have  noticed  that  the  Wj  ’s  are  also  constrained  by  a  positivity  re¬ 
quirement  (just  like  the  variances  in  a  normal  mixture  or  the  scale  parameters 
for  a  Gamma  mixture),  but  this  weaker  constraint  can  be  bypassed  using  the 
reparameterization  r\j  =  login,-.  The  proposed  random  walk  move  on  the  iu/’s 
is  thus 

log(w^)  =  log  |A_1)}  +  ui  > 


where  Uj  ~  W(0,  £2).  An  important  difference  from  the  original  random  walk 
Metropolis-Hastings  algorithm  is  that  the  acceptance  ratio  also  involves  a 
Jacobian  term.  For  instance,  the  acceptance  ratio  for  a  move  from  to 

w  is  then 


1  A 


7 t(w) 


k 


7 t(w^  P ) 


n 


w. 


it- 1) 


(6.6) 


=i  wj 


Note  that,  while  being  a  fairly  natural  algorithm,  the  random  walk  Metro¬ 
polis-Hastings  algorithm  usually  falls  victim  to  the  curse  of  dimensionality 
since,  obviously  the  same  scale  cannot  perform  well  for  every  component  of  the 
parameter  vector.  In  large  or  even  moderate  dimensions,  a  reparameterization 
of  the  parameter  and  preliminary  estimation  of  the  information  matrix  of  the 
distribution  are  thus  often  necessary  and  must  sometimes  be  completed  by 
Gibbs  steps  operating  in  lower  dimensions. 


6.5  Label  Switching  Difficulty 

A  basic  but  extremely  important  feature  of  a  mixture  model  is  that  it  is  in¬ 
variant  under  permutations  of  the  indices  of  the  components.  For  instance, 
the  normal  mixtures  0.3aF(0,  1) +  0.7aF(2.5,  1)  and  0.7aF(2.5,  1)  +  0.3aF(0,  1) 
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are  exactly  the  same.  Therefore,  the  c/F(2.5, 1)  distribution  cannot  be  called 
the  “first”  component  of  the  mixture!  In  other  words,  the  component  param¬ 
eters  Oi  are  not  identifiable  marginally  in  the  sense  that  0\  may  be  2.5  as  well 
as  0  in  the  example  above.  In  this  specific  case,  the  pairs  (0i,p)  and  (02,  1  —p) 
are  exchangeable. 

First,  in  a  /c-component  mixture,  the  number  of  modes  of  the  likelihood 
is  of  order  0(fc!)  since  if  ((0 1, . . . ,  0^),  (pi, . . .  ,p/c))  is  a  local  maximum  of  the 
likelihood  function,  so  is  r(0,  p)  =  (0Tm, . . .  ?  0T(fe)?Pr(i)?  •  •  •  ?Pr(fc))  f°r  every 
permutation  r  G  ©&,  the  set  of  all  permutations  of  This  makes 

maximization  and  even  exploration  of  the  posterior  surface  obviously  harder 
because  modes  are  separated  by  valleys  that  most  samplers  find  difficult  to 
cross. 

Second,  if  an  exchangeable  prior  is  used  on  (0,  p)  (that  is,  a  prior  invariant 
under  permutation  of  the  indices),  all  the  posterior  marginals  on  the  Oi  s  are 
identical,  a  fact  which  means  for  instance  that  the  posterior  expectation  of 
0i  is  identical  to  the  posterior  expectation  of  02-  Therefore,  alternatives  to 
posterior  expectations  must  be  considered  to  provide  pertinent  estimators. 

Continuation  of  Example  6.1.  In  the  special  case  of  model  (6.3),  if  we 
take  the  same  normal  prior  on  both  fi \  and  /i2,  pi,P2  ~  ^(0, 10)  ,  say,  the 
posterior  weight  conditional  on  p  associated  with  an  allocation  z  for  which 
n\  values  are  attached  to  the  first  component  will  simply  be 

w(z)  OC  Pni(l  -p)n~ni  J  e-"1(w-*1(z))2/2-(n-1)(M2-x2(Z))2/2d7r(Mi)d7r(A(2) 

x  exp  (-  {sl(z)  +  sl( z)}  /2) 

OC  a/  (^i  +  l/10)(n  -  ni  +  l/10)pni(l  -  p)n~ni  exp  (-  +  s\{  z) 

+ni{xi(z)}2/(10ni  +  1)  +  (n  -  ni){x2(z)}2/(10(n  -  ni)  +  1) }  /2)  , 

where  s2( z)  and  s|(z)  denote  the  sums  of  squares  for  both  groups.  ◄ 


For  the  Gibbs  output  of  License  discussed  above,  the  exchangeability 
predicted  by  the  theory  is  not  observed  at  all,  as  shown  in  Fig.  6.7.  This  figure 
is  derived  from  an  R  code  repeating  dual  plots  like 

>  simu=gibbsnorm ( 1000 , mix) 

>  plot (simu$mu [, 1] ,ylim=range (simu$mu) , 

+  ylab=expression(mu [i] ) ,xlab="n" ,type="l" , col="sienna3") 

>  lines (simu$mu [ , 2] , col="gold4") 

>  lines (simu$mu [,3] , col="steelblue") 

>  plot (simu$mu  [ , 2] , simu$p [ , 2] , col="sienna3" , 

+  xlim=range (simu$mu) ,ylim=range (simu$p) , 

+  xlab=expression(mu [i] ) ,ylab=expression(p [i] ) ) 

>  points (simu$mu[, 3] ,simu$p[,3] , col="steelblue" ) 
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In  Fig.  6.7,  we  see  that  each  component  is  thus  identified  by  its  mean,  and 
the  posterior  distributions  of  the  means  are  very  clearly  distinct.  Although  this 
result  has  the  appeal  of  providing  distinct  estimates  for  the  three  components, 
it  suffers  from  the  severe  drawback  that  the  Gibbs  sampler  has  not  explored 
the  whole  parameter  space  after  1,000  iterations.  Running  the  algorithm  for 
a  much  longer  period  does  not  solve  this  problem  since  the  Gibbs  sampler 
cannot  simultaneously  switch  enough  component  allocations  in  this  highly 
peaked  setup.  In  other  words,  the  algorithm  is  unable  to  explore  more  than 
one  of  the  3!  =  6  equivalent  modes  of  the  posterior  distribution.  Therefore,  it 
is  difficult  to  trust  the  estimates  derived  from  such  an  output. 
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Fig.  6.7.  Dataset  License:  (left)  Convergence  of  the  three  types  of  parameters  of 
the  normal  mixture,  each  component  being  identified  by  a  different  grey  level / color; 
(right)  2x2  plot  of  the  Gibbs  sample  for  the  three  types  of  parameters  of  a  normal 
mixture 
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This  identifiability  problem  related  to  the  exchangeability  of  the  posterior 
distribution,  often  called  “label  switching,”  thus  requires  either  an  alternative 
prior  modeling  or  a  more  tailored  inferential  approach.  A  naive  answer  to 
the  problem  is  to  impose  an  identifiability  constraint  on  the  parameters,  for 
instance  defining  the  components  by  ordering  the  means  (or  the  variances  or 
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the  weights)  in  a  normal  mixture  (see  Exercise  6.3).  From  a  Bayesian  point 
of  view,  this  amounts  to  truncating  the  original  prior  distribution,  going  from 
7T  (0,  p)  tO 

7T(0,p)  IM1<„.<W 


for  instance.  While  this  seems  innocuous  (given  that  the  sampling  distribution 
is  the  same  with  or  without  this  indicator  function),  the  introduction  of  an 
identifiability  constraint  has  severe  consequences  on  the  resulting  inference, 
both  from  a  prior  and  from  a  computational  point  of  view.  When  reducing 
the  parameter  space  to  its  constrained  part,  the  imposed  truncation  has  no 
reason  to  respect  the  topology  of  either  the  prior  or  the  likelihood.  Instead  of 
singling  out  one  mode  of  the  posterior,  the  constrained  parameter  space  may 
then  well  include  parts  of  several  modes  and  the  resulting  posterior  mean 
could,  for  instance,  lie  in  a  very  low  probability  region  between  the  modes, 
while  the  high  posterior  probability  zones  are  located  at  the  boundaries  of 
this  space. 

In  addition,  the  constraint  may  radically  modify  the  prior  modeling  and 
come  close  to  contradicting  the  prior  information.  For  large  values  of  /c,  the 
introduction  of  a  constraint  also  has  a  consequence  on  posterior  inference: 
With  many  components,  the  ordering  of  components  in  terms  of  one  of  the 
parameters  of  the  mixture  is  unrealistic.  Some  components  will  be  close  in 
mean  while  others  will  be  close  in  variance  or  in  weight.  This  may  even  lead  to 
very  poor  estimates  of  the  parameters  if  the  inappropriate  ordering  is  chosen. 

Note  that  while  imposing  a  constraint  that  is  not  directly  related  to  the 
modal  regions  of  the  target  distribution  may  considerably  reduce  the  efficiency 
of  an  MCMC  algorithm,  it  must  be  stressed  that  the  constraint  does  not  need 
to  be  imposed  during  the  simulation  but  can  instead  be  imposed  after  sim¬ 
ulation  by  reordering  the  MCMC  output  according  to  the  constraint.  For 
instance,  if  the  constraint  imposes  an  ordering  of  the  means,  once  the  sim¬ 
ulation  is  over,  the  components  can  be  relabeled  for  each  MCMC  iteration 
according  to  this  constraint;  that  is,  defining  the  first  component  as  the  one 
associated  with  the  smallest  simulated  mean  and  so  on.  From  this  perspective, 
identifiability  constraints  have  nothing  to  do  with  (or  against)  simulation. 

An  empirical  resolution  of  the  label  switching  problem  that  avoids  impos¬ 
ing  the  constraints  altogether  consists  of  arbitrarily  selecting  one  of  the  k\ 
modal  regions  of  the  posterior  distribution  once  the  simulation  step  is  over 
and  only  then  operate  the  relabeling  in  terms  of  proximity  to  this  region. 

Given  an  MCMC  sample  of  size  M,  we  can  find  a  Monte  Carlo  approxi¬ 
mation  of  the  maximum  a  posteriori  (MAP)  estimator  by  taking  0 ^  \  ) 

such  that 


■  * 


arg  max  n 


x 
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that  is,  the  simulated  value  that  gives  the  maximal  posterior  density.  (Note 
that  7 r  does  not  need  its  normalizing  constant  for  this  computation.)  This 
value  is  quite  likely  to  be  in  the  vicinity  of  one  of  the  k\  modes,  especially  if 
we  run  many  simulations.  The  approximate  MAP  estimate  will  thus  act  as  a 
pivot  in  the  sense  that  it  gives  a  good  approximation  to  a  mode  and  we  can 
reorder  the  other  iterations  with  respect  to  this  mode. 

Rather  than  selecting  the  reordering  based  on  a  Euclidean  distance  in 
the  parameter  space,  we  use  a  distance  in  the  space  of  allocation  probabili¬ 
ties.  Indeed,  the  components  of  the  parameter  vary  in  different  spaces,  from 
the  real  line  for  the  means  to  the  simplex  for  the  weights.  Let  be  the 
/c-permutation  set  and  r  E  ©&.  We  suggest  to  minimize  in  r  an  entropy  dis¬ 
tance  summing  the  relative  entropies  between  the  P (zt  =  j\0 ^  \  ^)’s  and 

the  P (zt  =  j\r  |(0^,  p^)j)’s,  namely 


n  k 

Hi,r)  =  =.?’PPp(0) 

t=  1  j=  1 


x  logjp(zt  =  j\0^  =j\r  (0(l), p(l))  )|  . 


The  selection  of  the  permutations  reordering  the  MCMC  output  thus  reads 
as  follows: 


Algorithm  6.12  Pivotal  Reordering 
At  iteration  i  e  {1, . . . ,  M}: 

1.  Compute 


t i  =  arg  min  h(i,r) , 
re&k 


2.  Set  (0w,pW)  =  Ti{(0(l),pw)}- 


Thanks  to  this  reordering,  most  iteration  labels  get  switched  to  the  same 
mode  (when  n  gets  large,  this  is  almost  a  certainty),  and  the  identifi ability 
problem  is  thus  solved.  Therefore,  after  this  reordering  step,  the  Monte  Carlo 
estimate  of  the  posterior  expectation  E^^x], 

M 

/. M , 

3  = 1 
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can  be  used  as  in  a  standard  setting  because  the  reordering  automatically 
gives  different  meanings  to  different  components.  Obviously,  E^^x]  (or  its 
approximation)  should  also  be  compared  with  6^  )  to  check  convergence.9 

Using  the  Gibbs  output  simu  of  License  (which  is  the  datha  of  the  fol¬ 
lowing  code)  as  in  the  previous  illustration,  the  corresponding  R  code  involves 
the  determination  of  the  MAP  approximation 

indimap=order (simu$lopost , decreasing=TRUE) [1] 
map=list (mu=simu$mu [indimap,] , sig=simu$sig [indimap,]  , 
p=simu$p [indimap , ] ) 

that  is  easily  derived  by  storing  the  values  of  the  log-likelihood  densities  in 
the  Gibbs  sampling  function  gibbsnorm.  The  corresponding  (MAP)  allocation 
probabilities  for  the  data  are  then 

lili=alloc=matrix(0 , length (datha) ,3) 
for  (t  in  1 : length (datha) ) { 

lili [t ,  ]  =map$p*dnorm (datha [t] ,mean=map$mu, 

sd=sqrt (map$sig) ) 

lili [t , ] =lili [t , ] / sum (lili [t ,  ] ) 

> 

They  are  used  as  reference  for  the  reordering: 

ormu=orsig=orp=matrix (0 ,ncol=3 ,nrow=1000) 
library (combinat) 
perma=permn (3) 
for  (t  in  1:1000){ 

entropies=rep (0 , factorial (3) ) 
for  (j  in  l:n){ 

alloc [j ,]  =simu$p [t ,] *dnorm(datha[j] ,mean=simu$mu [t , ] , 
sd=sqrt (simu$sig [t , ] ) ) 
alloc  [j  ,]  =alloc  [j  ,]  /sum(alloc  [j  ,] ) 
for  (i  in  1 : factorial (3) ) 
entropies [i] =entropies [i]  + 
sum (lili [j ,] *log (alloc [j ,perma[ [i]  ]  ] ) ) 

> 

best=order (entropies , decreasing=TRUE) [1] 
ormu [t , ] =simu$mu [t ,perma [ [best]  ]  ] 
orsig  [t , ] =simu$sig [t , perma [ [best]  ]  ] 
orp [t ,] =simu$p  [t , perma [ [best]  ]  ] 

> 


9While  this  resolution  seems  intuitive  enough,  there  is  still  a  lot  of  debate  in 
academic  circles  on  whether  or  not  label  switching  should  be  observed  on  an  MCMC 
output  and,  in  case  it  should,  on  which  substitute  to  the  posterior  mean  should  be 
used. 
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An  output  comparing  the  original  MCMC  sample  and  the  one  correspond¬ 
ing  to  this  reordering  for  the  License  dataset  is  then  constructed.  However, 
since  the  Gibbs  sampler  does  not  switch  between  the  k\  modes  in  this  case, 
the  above  reordering  does  not  modify  the  labelling  and  we  thus  abstain  from 
producing  the  corresponding  graph  as  it  is  identical  to  Fig.  6.7. 


6.6  Prior  Selection 

After* 11  insisting  in  Chap.  2  that  conjugate  priors  are  not  the  only  possibility 
for  prior  modeling,  we  seem  to  be  using  them  quite  extensively  in  this  chapter! 
The  fundamental  reason  for  this  is  that,  as  explained  below,  it  is  not  possible 
to  use  the  standard  alternative  of  noninformat ive  priors  on  the  components. 
Nonconjugate  priors  can  be  used  as  well  (with  Metropolis-Hastings  steps)  but 
are  difficult  to  fathom  when  the  components  have  no  specific  “real”  meaning 
(as,  for  instance,  when  the  mixture  is  used  as  a  nonparametric  proxy). 

The  representation  (6.2)  of  a  mixture  model  precludes  the  use  of 
independent  improper  priors, 


k 


= rivvt), 


3  = 1 


since  if,  for  any  1  <  3  <  k, 


then,  for  every  n, 


(0,  p|x)d0dp 


=  oo . 


The  reason  for  this  inconvenient  behavior  is  that  among  the  kn  terms  in  the 
expansion  (6.5)  of  i r(0,  p|x),  there  are  (k  —  l)n  terms  without  any  observation 
allocated  to  the  ith  component  and  thus  there  are  (k  —  l)n  terms  with  a 
conditional  posterior  7r(C|x,z)  that  is  equal  to  the  prior  7 q(0^). 

The  inability  to  use  improper  priors  may  be  seen  by  some  as  a  margina¬ 
lia ,  a  fact  of  little  importance,  since  they  argue  that  proper  priors  with  large 
variances  can  be  used  instead.  However,  since  mixtures  are  ill-posed  prob¬ 
lems,11  this  difficulty  with  improper  priors  is  more  of  an  issue,  given  that  the 

10 This  section  may  be  skipped  by  most  readers,  as  it  only  addresses  the  very 
specific  issue  of  handling  improper  priors  in  mixture  estimation. 

11  By  nature,  ill-posed  problems  are  not  precisely  defined.  They  cover  classes  of 
models  such  as  inverse  problems ,  where  the  complexity  of  getting  back  from  the 
data  to  the  parameters  is  huge.  They  are  not  to  be  confused  with  nonidentifiable 
problems,  though. 
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influence  of  a  particular  proper  prior,  no  matter  how  large  its  variance,  cannot 
be  truly  assessed.  In  other  words,  the  prior  gives  a  specific  meaning  to  what 
distinguishes  one  component  from  another. 

^  Prior  distributions  must  always  be  chosen  with  the  utmost  care  when  dealing 
with  mixtures  and  their  bearing  on  the  resulting  inference  assessed  by  a  sen¬ 
sitivity  study.  The  fact  that  some  noninformative  priors  are  associated  with 
undefined  posteriors,  no  matter  what  the  sample  size,  is  a  clear  indicator  of  the 
complex  nature  of  Bayesian  inference  for  those  models. 


6.7  Tempering 


The  notion  of  tempering  can  be  found  in  different  areas  under  many  different 
denominations,  but  it  always  comes  down  to  the  same  intuition  that  governs 
simulated  annealing  (Chap.  8),  namely  that  when  you  flatten  a  posterior  sur¬ 
face,  it  is  easier  to  move  around,  while  if  you  sharpen  it,  it  gets  harder  to  do 
so  except  around  peaks. 

More  formally,  given  a  density  i r(x),  we  can  define  an  associated  density 
7 ra(x)  oc  7r(x)“  for  a  >  0  large  enough  (if  a  is  too  small,  n(x)a  does  not 
integrate).  An  important  property  of  this  family  of  distributions  is  that  they 
all  share  the  same  modes.  When  a  >  1,  the  surface  of  7ra  is  more  contrasted 
than  the  surface  of  i r:  Peaks  are  higher  and  valleys  are  lower.  Increasing  a  to 
infinity  results  in  a  Dirac  mass  at  the  modes  of  i r,  and  this  is  the  principle 
behind  simulated  annealing.  Conversely,  lowering  a  to  values  less  than  1  makes 
the  surface  smoother  by  lowering  peaks  and  raising  valleys.  In  a  compact 
space,  lowering  a  to  0  ends  up  with  the  uniform  distribution. 

This  rather  straightforward  intuition  can  be  exploited  in  several  directions 
for  simulation.  For  instance,  a  tempered  version  of  i r,  7ra,  can  be  simulated  in 
a  preliminary  step  to  determine  where  the  modal  regions  of  7r  are.  (Different 
values  of  a  can  be  used  in  parallel  to  compare  the  results.)  This  preliminary 
exploration  can  then  be  used  to  build  a  more  appropriate  proposal.  Alter¬ 
natively,  these  simulations  may  be  pursued  and  associated  with  appropriate 
importance  weights.  Note  also  that  a  regular  Metropolis-Hastings  algorithm 
may  be  used  with  7ra  just  as  well  as  with  tt  since  the  acceptance  ratio  is 
transformed  into 


tt(6\p' 

X) 

7r(0,p 

X) 

a 


g(g1p|gCpO 

q(d',p'\d,p) 


A  1 


(6.7) 


in  the  case  of  the  mixture  parameters,  with  the  same  irrelevance  of  the 
normalizing  constants. 

Continuation  of  Example  6.1.  If  we  consider  once  more  the  posterior  as¬ 
sociated  with  (6.3),  we  can  check  in  Fig.  6.8  the  cumulative  effect  of  a  small 
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hi  hi  hi 


Fig.  6.8.  Comparison  of  Metropolis-Hastings  samples  of  104  points  started  in  the 
vicinity  of  the  spurious  mode  for  the  target  distributions  n a  when  a  =  1,0.1,0.01 
(from  left  to  right),  7r  is  the  same  as  in  Fig.  6.6,  and  the  proposal  is  a  random  walk 
with  variance  0.1  (the  shape  of  log-likelihood  does  not  changed) 


variance  for  the  random  walk  proposal  (chosen  here  as  0.1)  and  a  decrease  in 
the  power  a.  The  R  function  used  to  produce  this  figure  is 

hmmeantemp=function(dat ,niter=100 , var= . 1 , alpha=l) { 
mu=matr ix (0 , niter ,2) 
mu[l ,] =c(l ,3) 
for  (i  in  2:niter){ 

muprop=rnorm(2 ,mu  [i— 1 , ] , sqrt (var) ) 
bound=lpost (dat ,muprop) -lpost (dat ,mu [i— 1 ,  ]  ) 
if  (runif (1) <=exp(alpha*bound) )  mu [i , ] =muprop  else 
mu [i , ] =mu [i— 1 , ] 

} 

mu 

> 

It  thus  constitutes  a  very  straightforward  modification  of  the  original  Metro¬ 
polis-Hastings  algorithm.  For  the  genuine  target  distribution  i r  (left),  10,000 
iterations  of  the  Metropolis-Hastings  algorithm  are  not  nearly  sufficient  to 
remove  the  attraction  of  the  lower  mode.  When  a  =  0.1,  we  can  reasonably 
hope  that  a  few  thousand  more  iterations  could  bring  the  Markov  chain  toward 
the  other  mode.  For  a  =  0.01,  only  a  few  iterations  suffice  to  switch  modes, 
given  that  the  saddle  between  both  modes  is  not  much  lower  than  the  modes 
themselves.  (The  best  way  to  check  this  fact  and  to  select  a  in  practice  is  to 
run  the  R  code!)  ◄ 
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6.8  Mixtures  with  an  Unknown  Number  of  Components 

While  the  standard  interpretation  of  mixtures  gives  each  component  a 
meaning,  the  semiparametric  approach  to  mixtures  only  perceives  com¬ 
ponents  as  base  elements  in  a  representation  of  an  unknown  density.  In  that 
perspective,  the  number  k  of  components  represents  the  degree  of  approxi¬ 
mation,  and  it  has  no  particular  reason  to  be  fixed  in  advance.  Even  from  the 
traditional  perspective,  it  may  also  happen  that  the  number  of  homogeneous 
groups  within  the  population  of  interest  is  unknown  and  that  inference  first 
seeks  to  determine  this  number.  For  instance,  in  a  marketing  study  of  Web¬ 
browsing  behaviors,  it  may  well  be  that  the  number  of  different  behaviors  is 
unknown.  Also,  for  instance,  in  the  analysis  of  financial  stocks,  the  number 
of  different  patterns  in  the  evolution  of  these  stocks  may  be  unknown  to 
the  analyst.  For  these  different  situations,  it  is  thus  necessary  to  extend  the 
previous  setting  to  include  inference  on  the  number  k  of  components  itself. 

Inference  on  such  a  structure  is  somehow  more  complicated  than  on  single 
models,  especially  when  there  are  an  infinite  number  of  submodels,  i.e.  when 
k  is  not  bounded,  and  it  can  be  tackled  from  two  different  (or  even  opposite) 
perspectives.  The  first  approach  is  to  consider  the  variable  dimension  model 
as  a  whole  and  to  estimate  quantities  that  are  meaningful  for  the  whole  model 
(such  as  moments  or  predictives)  as  well  as  quantities  that  only  make  sense 
for  submodels  (such  as  posterior  probabilities  of  submodels  and  posterior  mo¬ 
ments  of  Ok)-  From  a  Bayesian  perspective,  once  a  prior  is  defined  on  0,  the 
only  difficulty  is  in  finding  an  efficient  way  to  explore  the  complex  parameter 
space  in  order  to  produce  these  estimators.  The  second  perspective  on  variable 
dimension  models  is  to  resort  to  testing ,  rather  than  estimation,  by  adopting 
a  model  choice  stance.  This  requires  choosing  among  all  possible  submodels 
the  “best  one”  in  terms  of  an  appropriate  criterion,  usually  through  the  Bayes 
factor  (Sect.  2.3.2).  The  computational  resolution  of  the  comparison  when  the 
number  of  models  is  infinite  requires  MCMC  exploration,  while  the  variability 
of  the  resulting  inference  may  be  underestimated  if  the  selection  of  the  model 
is  not  accounted  for  in  the  assessment  of  the  variability.  Nonetheless,  this  is  an 
approach  often  used  in  linear  and  generalized  linear  models  (Chaps.  3  and  4) 
where  subgroups  of  covariates  are  compared  against  a  given  dataset. 

Mixtures  with  an  unknown  number  of  components  are  one  particular  in¬ 
stance  of  variable  dimension  models.  Other  cases  include  the  selection  of  co¬ 
variates  among  k  possible  covariates  in  a  generalized  linear  model  (Chap.  4) 
which  can  be  seen  as  a  collection  of  2k  submodels  (depending  on  the  presence 
or  absence  of  each  covariate).  Similarly,  in  a  time  series  model  such  as  the  AR 
and  MA  models  (Chap.  7),  the  value  of  the  lag  dependence  can  be  left  open, 
depending  on  the  data  at  hand.  Other  instances  are  the  determination  of  the 
order  in  a  hidden  Markov  model  (Chap.  7),  as  in  DNA  sequences  where  the 
dependence  of  the  past  bases  may  go  back  for  one,  two,  or  more  steps,  or  even 
in  a  capture-recapture  experiment  (Chap.  5)  when  one  estimates  the  number 
of  species  from  the  observed  species. 
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While  we  opt  here  for  a  testing  perspective,  a  more  generic  simulation 
technique  called  reversible  jump  has  been  developed  by  Green  (1995).  While 
it  was  exposed  in  the  earlier  edition  (Marin  and  Robert,  2007),  it  requires 
both  a  high  degree  of  formalization  and  a  very  sensitive  calibration.  In  the 
specific  case  of  mixtures,  the  number  of  models  under  comparison  (i.e.,  the 
range  of  k)  is  usually  small  enough  to  prefer  an  enumeration  of  all  models  and 
hence  an  approximation  of  all  marginal  likelihoods. 

Similarly,  Dirichlet  processes  are  often  advanced  as  alternative  to  the 
estimation  of  the  number  of  components  for  mixtures  because  they  naturally 
embed  a  clustering  mechanism.  A  Dirichlet  process  is  a  nonparametric  object 
that  formally  involves  a  countably  infinite  number  of  components.  Nonethe¬ 
less,  inference  on  Dirichlet  processes  for  a  finite  sample  size  produces  a  random 
number  of  clusters,  which  can  be  used  as  an  estimate  of  the  number  of  com¬ 
ponents.  Since  the  technical  complexity  of  those  objects  is  too  high  for  this 
book,  we  refer  to  Hjort  et  al.  (2010)  for  detail. 

Once  testing  is  adopted  as  the  setting  of  reference,  the  implementation  of 
the  principle  boils  down  to  study  some  proposals  regarding  approximations 
of  the  Bayes  factor  oriented  towards  the  direct  exploitation  of  outputs  from 
single  model  MCMC  runs. 

In  fact,  the  major  difference  between  approximations  of  Bayes  factors 
based  on  those  outputs  and  approximations  based  on  the  output  from  the 
reversible  jump  chains  is  that  the  latter  requires  a  sufficiently  efficient  choice 
of  proposals  to  move  around  models,  which  can  be  difficult.  If  we  can  instead 
concentrate  the  simulation  effort  on  single  models,  the  complexity  of  the  al¬ 
gorithm  decreases  (a  lot)  and  there  exist  ways  to  evaluate  the  performance  of 
the  corresponding  MCMC  samples.  In  addition,  it  is  often  the  case  that  few 
models  are  in  competition  when  estimating  k  and  it  is  therefore  possible  to 
visit  the  whole  range  of  potentials  models  in  an  exhaustive  manner. 

We  have 

n  J 


i=l j=l 


where  A  j  =  (0,p)  =  (0i, . . . ,  0j,pi, . . .  ,pj).  Most  solutions  (see,  e.g. 
Friihwirth-Schnatter,  2006,  Sect.  5.4)  revolve  around  an  importance  sam¬ 
pling  approximation  to  the  marginal  likelihood  integral 


mj(x) 


/j(x|Aj)  7Tj(Aj)  dA 


j 


where  J  denotes  the  model  index  (that  is  the  number  of  components  in  the 
present  case).  A  different  possibility  is  to  use  Gelfand  and  Dey  (1994)  repre¬ 
sentation:  starting  from  an  arbitrary  density  gj,  the  equality 
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1 


gj(Xj)  d\j 


9j(Xj) 


raj(x) 


/j(x|Aj)  7Tj(Aj) 

gj{Xj ) 


/j(x|Aj)  7tj(Aj)  dAj 


/j(x|Aj)  7Tj (A j ) 


7Tj(Aj|x)  dAj 


implies  that  a  potential  estimate  of  mj(x)  is 


1 


T 


mj 


(x)  =  1  /  r  E 


t=  1 


when  the  A  j  ^  ’s  are  produced  by  a  Monte  Carlo  or  an  MCMC  sampler  targeted 
at  7Tj(Aj|x). 

While  this  solution  can  be  easily  implemented  in  low  dimensional  settings, 
calibrating  the  auxiliary  density  g &  is  always  an  issue.  The  auxiliary  density 
could  be  selected  as  a  non-parametric  estimate  of  nk{Xj\x)  based  on  the 
sample  itself  but  this  is  very  costly.  Another  difficulty  is  that  the  estimate 
may  have  an  infinite  variance  and  thus  be  too  variable  to  be  trustworthy. 

Yet  another  approximation  to  the  integral  mj(x)  is  to  consider  it  as  the 
expectation  of  /j(x  |Aj),  when  A  j  is  distributed  from  the  prior.  While  a  brute 
force  approach  simulating  A  j  from  the  prior  distribution  requires  a  huge  num¬ 
ber  of  simulations! 

We  consider  here  a  further  solution,  first  proposed  by  Chib  (1995),  that 
is  straightforward  to  implement  in  the  setting  of  mixtures.  Although  this 
method  may  fail  because  of  the  lack  of  label  switching,  we  show  below  how 
the  difficulty  can  easily  be  removed.  Chib’s  method  is  directly  based  on  the 
expression  of  the  marginal  distribution  (loosely  called  marginal  likelihood  in 
this  section)  in  Bayes’  theorem: 


raj(x)  = 


/j(x|Aj)  7Tj (A j) 
7Tj(Aj|x) 


and  on  the  property  that  the  rhs  of  this  equation  is  constant  in  A  j.  Therefore, 
if  an  arbitrary  value  of  A  j,  A}  say,  is  selected  and  if  a  good  approximation  to 
7r j  (A  j  |x)  can  be  constructed,  7Tj(Aj|x),  Chib’s  approximation  to  the  marginal 
likelihood  is 


mj(x) 


7Tj(V|x) 


(6.8) 


In  the  case  of  mixtures,  a  natural  approximation  to  7Tj(Aj|x)  is  the 
Rao-Blackwell  estimate 


1 


T 


7Tj(A}|x)  =  -  yWj(A 


x,  z 


), 


(6.9) 


t=  1 


where  the  z^’s  are  the  latent  variables  simulated  by  the  MCMC  sampler.  To 
be  efficient,  this  method  requires 
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(a)  a  good  choice  of  Xj  but,  since  in  the  case  of  mixtures,  the  likelihood  is 
computable,  A}  can  be  chosen  as  the  MCMC  approximation  to  the  MAP 
estimator  (see  Algorithm  6.12)  and, 

(b)  a  good  approximation  to  7Tj(Aj|x). 


This  latter  requirement  is  paramount:  while,  at  a  formal  level,  ttj(A}|x)  is 
a  converging  (parametric)  approximation  to  7Tj(Aj|x)  by  virtue  of  the  er- 
godic  theorem,  this  obviously  requires  the  chain  (z^)  to  converge  to  its  sta¬ 
tionary  distribution.  Unfortunately,  as  discussed  previously,  in  the  case  of 
mixtures,  the  Gibbs  sampler  rarely  converges  because  of  the  label  switching 
phenomenon,  so  the  approximation  7fj(A}|x)  is  untrustworthy.  It  is  easily  seen 
via  a  numerical  experiment  that  (6.8)  is  significantly  different  from  the  true 
value  raj(x)  when  label  switching  does  not  occur.  There  is,  however,  a  fix  to 
this  problem  which  is  to  recover  the  label  switching  symmetry  a  posteriori, 
replacing  7Tj(A}|x)  in  (6.9)  above  with 


T 

crGSj  t= 1 


where  ©j  denotes  the  set  of  all  permutations  of  {1, . . . ,  J}  and  cr(Aj)  denotes 
the  transform  of  A}  where  components  are  switched  according  to  the  permu¬ 
tation  a.  Note  that  the  permutation  can  equally  be  applied  to  A  }  or  to  the 
z^’s  but  that  the  former  is  usually  more  efficient  from  a  computational  point 
of  view  given  that  the  sufficient  statistics  only  have  to  be  computed  once.  The 
justification  for  this  modification  stems  from  a  Rao-Blackwellization  argu¬ 
ment,  namely  that  the  permutations  are  ancillary  for  the  problem  and  should 
be  integrated  out. 

Example  6.3.  In  the  case  of  the  normal  mixture  case  and  a  benchmark  called 
the  “galaxy  dataset”  (Robert  and  Casella,  2004,  Chap.  11,  Table  11.1)  Gibbs 
sampling  does  not  produce  any  label  switching.  If  we  compute  log  mj(x)  using 
Chib’s  original  estimate  (6.8),  the  [logarithm  of  the]  estimated  marginal  like¬ 
lihood  is 

pj  (x)  =  -105.1396 

for  J  =  3  (based  on  103  simulations),  while  introducing  the  permutations 
leads  to 

Pj  (x)  =  -103.3479. 

As  noted  by  Fruhwirth-Schnatter  (2006),  the  difference  between  the  origi¬ 
nal  Chib’s  approximation  and  the  true  marginal  likelihood  is  close  to  log(  J!) 
(only)  when  the  Gibbs  sampler  remains  concentrated  around  a  single  mode 
of  the  posterior  distribution.  In  the  current  case,  we  have  that 


-116.3747  +  log(2!)  =  -115.6816 

exactly!  (We  also  checked  this  numerical  value  of  the  marginal  likelihood 
against  a  brute-force  estimate  obtained  by  simulating  from  the  prior  and 
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averaging  the  likelihood,  up  to  a  fourth  digit  agreement.)  A  similar  result 
holds  for  J  =  3,  with 


-105.1396  +  log(3!)  =  -103.3479. 

For  J  =  4,  we  get  for  instance  that  the  original  Chib’s  approximation  is 
—  104.1936,  while  the  average  over  permutations  gives  —102.6642.  Similarly, 
for  J  =  5,  the  difference  between  —103.91  and  —101.93  is  less  than  log(5!).  The 
log(J!)  difference  cannot  therefore  be  used  as  a  direct  correction  for  Chib’s 
approximation  because  of  this  difficulty  in  controlling  the  amount  of  overlap. 
However,  it  is  unnecessary  since  using  the  permutation  average  resolves  the 
difficulty.  Table  6.1  shows  that  the  preferred  value  of  J  for  the  Galaxy  dataset 
and  the  current  choice  of  prior  distribution  is  J  =  5. 


J 

2 

3 

4 

5 

6 

7 

8 

pj( x) 

-115.68 

-103.35 

-102.66 

-101.93 

-102.88 

-105.48 

-108.44 

Table  6.1.  Dataset  Galaxy:  estimations  of  the  marginal  log-likelihoods  by  the 
symmetrized  Chib’s  approximation 


When  the  number  of  components  J  grows  too  large  for  all  permutations 
in  @  j  to  be  considered  in  the  average,  a  (random)  subsample  of  permutations 
can  be  simulated  to  keep  the  computing  time  to  a  reasonable  level  (obviously 
keeping  the  identity  as  one  of  the  selected  permutations!),  as  in  Table  6.1  for 
J  =  6,7.  Note  also  that  the  discrepancy  between  the  original  Chib’s  (1995) 
approximation  and  the  average  over  permutations  is  a  good  indicator  of  the 
mixing  properties  of  the  Markov  chain,  if  a  further  convergence  indicator  is 
requested. 

We  implemented  Chib’s  method  for  the  License  dataset  in  the  func¬ 
tion  gibbsnorm (niter , mix) .  The  code  relies  on  the  combinatorial  package 
combinat  in  order  to  store  all  possible  permutations: 

lolik=rep (0 , niter) 
library (combinat) 

perms=matrix (unlist (permn(k) ) ,ncol=k,byrow=T) 
nperms=dim (perms) [1] 

The  marginal  likelihood  is  then  averaged  over  iterations  and  permutations 

chibdeno=0 

for  (j  in  l:nperms) 

chibdeno=chibdeno+exp (sum(dnorm(imig [i+1 , perms [j ,]  ]  , 
mean= (mean (datha) +nxj ) / ( 1+ssiz) , 
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sd=sqrt (sigg  [i+1 , perms [j ,]] ) /sqrt ( (1+ssiz) ) ,log=TRUE))+ 
sum(dgamma(l/sigg [i+1 , perms [j ,] ] , shape= . 5* (20+ssiz) , 
rate=var (datha)+ . 5*ssum+ . 5*ssiz/ 

(ssiz+1)* (mean (datha) -nxj /ssiz) "2 , log=TRUE) 

-2*log(sigg [i+1 , perms  [j ,]]))  + 

sum( (ssiz-0 . 5) *log(prog [i+1 , perms [j , ] ] ) ) + 

lgamma(sum(ssiz+0 . 5) ) -sum(lgamma(ssiz+0 . 5) ) ) 

the  function  returning  a  list  list( .  .  .  , lolik=lolik,deno=chibdeno) .  Using 
the  code, 

>  simu=gibbsnorm ( 1000 , mix) 

>  lopos=order (simu$lopost) [1000] 

>  lmiml=simu$lolik  [lopos] 

>  lnum2=sum(dnorm(simu$mu [lopos ,]  , 

+  mean=mean (datha) , sd=simu$sig [lopos , ] ,log=TRUE)+ 

+  dgamma(l/simu$sig  [lopos , ] , 10 ,var (datha) ,log=TRUE)- 
+  2*log(simu$sig  [lopos , ] ) )  + 

+  sum((rep(0.5,k)-l)*log(simu$p[lopos,] ))+ 

+  lgamma (sum (rep (0 . 5 , k) ) ) -sum (1 gamma (rep (0 . 5 , k) ) ) 

>  lchibapprox2=lmiml+lmim2-log(simu$deno) 

we  obtain  Table  6.2  which  gives  the  approximations  of  the  marginal  likeli¬ 
hoods  from  k  =  2  to  k  =  8.  For  the  License  dataset,  the  favored  number  of 
components  is  thus  k  =  4. 


k 

2 

3 

4 

5 

6 

Pfc(x) 

-5373.445 

-5315.351 

-5308.79 

-5336.23 

-5341.524 

Table  6.2.  Dataset  License:  estimations  of  the  marginal  log-likelihoods  by  the 
symmetrized  Chib’s  approximation 


6.9  Exercises 

6.1  Show  that  a  mixture  of  Bernoulli  distributions  is  again  a  Bernoulli  distribution. 
Extend  this  to  the  case  of  multinomial  distributions. 

6.2  Show  that  the  number  of  nonnegative  integer  solutions  of  the  decomposition  of  n 
into  k  parts  such  that  m  +  . . .  +  rik  is  equal  to 

/  n  +  k  —  1  \ 


r  — 


n 
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Deduce  that  the  number  of  partition  sets  is  of  order  O (nk  1).  (Hint:  This  is  a  classical 
combinatoric  problem.) 

6.3  For  a  mixture  of  two  normal  distributions  with  all  parameters  unknown, 

pJ^{lM,af)  +  (1  , 


and  for  the  prior  distribution  (j  =  1,  2) 


Mj 


dj  ~ 


2 

(T  ■  r^1 
U  J 


S9{Vjl2,82}l2), 


show  that 


p  ~  &e(ot,  /?)  , 


p|x,  z  ~  ^e(a  +  ^i,  [3  +  ^2), 


Mi 


CTi,X,Z 


x,  z  ~ 


+^)/2,Sj(z)/2) , 


where  is  the  number  of  z,  equal  to  j,  x,j ( z )  and  s|(z)  are  the  empirical  mean  and 
variance  for  the  subsample  with  Zi  equal  to  j,  and 


0(z)  = 


njij  +^a;j(z) 

Tlj  £j 


Sj(z)  =  s]  +£js](  z)  + 


71iT 


Tlj  +  £■ 


Compute  the  corresponding  weight  cj(z). 


6.4  For  the  normal  mixture  model  of  Exercise  6.3,  compute  the  function  Q(0o,0)  and 
derive  both  steps  of  the  EM  algorithm.  Apply  this  algorithm  to  a  simulated  dataset  and 
test  the  influence  of  the  starting  point  Oq. 


6.5  In  the  mixture  model  with  independent  priors  on  the  0/ s,  show  that  the  0/s  are 
dependent  on  each  other  given  (only)  x  by  summing  out  the  z’s. 

6.6  Construct  and  test  the  Gibbs  sampler  associated  with  the  (£,//o)  parameterization 
of  (6.3),  when  fi\  —  po  —  £  and  ji2  =  Mo  +  £■ 


6.7  Show  that,  if  an  exchangeable  prior  n  is  used  on  the  vector  of  weights  (p  1, . . . ,  Pk), 
then,  necessarily,  \pj]  =  l/k  and,  if  the  prior  on  the  other  parameters  (Oi, ...  ,0k)  is 
also  exchangeable,  then  ^\pj \x\, . . . , xn]  =  1/k  for  all  j’s. 


6.8  Show  that  running  an  MCMC  algorithm  with  target  7r(0|x)7  will  increase  the 
proximity  to  the  MAP  estimate  when  7  >  1  is  large.  (Note:  This  is  a  crude  version  of 
the  simulated  annealing  algorithm.  See  also  Chap.  8.)  Discuss  the  modifications  required 
in  Algorithm  6.11  to  achieve  simulation  from  7r(0|x)7  when  7  £  N*  is  an  integer. 

6.9  Show  that  the  ratio  (6.7)  goes  to  1  when  a  goes  to  0  when  the  proposal  q  is  a 
random  walk.  Describe  the  average  behavior  of  this  ratio  in  the  case  of  an  independent 
proposal. 


6.10  If  one  needs  to  use  importance  sampling  weights,  show  that  the  simultaneous 
choice  of  several  powers  a  requires  the  computation  of  the  normalizing  constant  of  na. 

6.11  In  the  setting  of  the  mean  mixture  (6.3),  run  an  MCMC  simulation  experiment 
to  compare  the  influence  of  a  7C(0, 100)  and  of  a  tE( 0, 10000)  prior  on  (jlh,jli2)  on  a 
sample  of  500  observations. 

6.12  Show  that,  for  a  normal  mixture  0.5  tE(0,  1)  +  0.5  yK(ju,  a2),  the  likelihood  is 
unbounded.  Exhibit  this  feature  by  plotting  the  likelihood  of  a  simulated  sample  using 
the  R  image  procedure. 
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Rebus  was  intrigued  by  the  long  gaps 

in  the  chronology. 

— Ian  Rankin,  The  Falls. — 


Roadmap 

At  one  point  or  another,  everyone  has  to  face  modeling  time  series  datasets,  by 
which  we  mean  series  of  dependent  observations  that  are  indexed  by  time  (like 
both  series  in  the  picture  above!).  As  in  the  previous  chapters,  the  difficulty  in 
modeling  such  datasets  is  to  balance  the  complexity  of  the  representation  of  the 
dependence  structure  against  the  estimation  of  the  corresponding  model — and 
thus  the  modeling  most  often  involves  model  choice  or  model  comparison.  We 
cover  here  the  Bayesian  processing  of  some  of  the  most  standard  time  series  mod¬ 
els,  namely  the  autoregressive  and  moving  average  models,  as  well  as  extensions 
that  are  more  complex  to  handle  like  stochastic  volatility  models  used  in  finance. 

This  chapter  also  covers  the  more  complex  dependence  structure  found  in  hid¬ 
den  Markov  models,  while  spatial  dependence  in  considered  in  Chap.  8.  The  reader 
should  be  aware  that,  due  to  mathematical  constraints  related  to  the  long-term 
stability  of  the  series,  this  chapter  contains  more  advanced  material,  although  we 
restrained  from  introducing  complex  simulation  procedures  on  variable-dimension 
spaces. 


J.-M.  Marin  and  C.P.  Robert,  Bayesian  Essentials  with  R ,  Springer  Texts 
in  Statistics,  DOI  10. 1007/978- 1-4614-8687-9_7, 
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7.1  Time-Indexed  Data 

While  we  started  with  independent  (and  even  iid)  observations,  for  the  obvious 
reason  that  they  are  easier  to  process,  we  soon  departed  from  this  setup,  gath¬ 
ering  more  complexity  either  through  heterogeneity  as  in  the  linear  and  gener¬ 
alized  linear  models  (Chaps.  3  and  4),  or  through  some  dependence  structure 
as  in  the  open  capture-recapture  models  of  Chap.  5  that  pertain  to  the  generic 
notion  of  hidden  Markov  models  covered  in  Sect.  7.5. 

7.1.1  Setting 

This  chapter  concentrates  on  time-series  (or  dynamic )  models,  which  somehow 
appear  to  be  simpler  because  they  are  unidimensional  in  their  dependence, 
being  indexed  only  by  time.  Their  mathematical  validation  and  estimation 
are  however  not  so  simple,  while  they  are  some  of  the  most  commonly  used 
models  in  applications,  ranging  from  finance  and  economics  to  reliability,  to 
medical  experiments,  and  ecology.  This  is  the  case,  for  instance,  for  series 
of  pollution  data,  such  as  ozone  concentration  levels,  or  stock  market  prices, 
whose  value  at  time  t  depends  at  least  on  the  previous  value  at  time  t  —  1. 

The  dataset  we  use  in  this  chapter  is  a  collection  of  four  time  series  con¬ 
nected  with  the  stock  market.  Figure  7.1  plots  the  successive  values  from  Jan¬ 
uary  1,  1998,  to  November  9,  2003,  of  those  four  stocks1  which  are  the  first 
ones  (in  alphabetical  order)  to  appear  in  the  financial  index  Eurostoxx50,  a 
financial  reference  for  the  euro  zone2  made  of  50  major  stocks.  These  four 
series  constitute  the  Eurostoxx50  dataset.  A  perusal  of  these  graphs  is  suffi¬ 
cient  for  rejecting  the  assumption  of  independence  of  these  series:  High  values 
are  followed  by  high  values  and  small  values  by  small  values,  even  though  the 
variability  (or  volatility )  of  the  stocks  varies  from  share  to  share. 


The  simplest  mathematical  structure  for  a  time  series  is  when  the  series 
(xt)  is  Markov.  We  recall  that  a  stochastic  process  (xt)teTi  that  is,  a  sequence 
of  random  variables  indexed  by  the  t’s  in  T  (where,  here,  T  is  equal  to  N  or  Z) 
is  a  Markov  chain  when  the  distribution  of  xt  conditional  on  the  past  values 
(for  instance,  x0:(£_i)  =  (#o  •  •  •  >  ^t-i)  when  T  =  N)  only  depends  on  xt-i- 
This  process  is  homogeneous  if  the  distribution  of  xt  conditional  on  the  past 


1  The  four  stocks  are  as  follows.  ABN  Amro  is  an  international  bank  from 
Holland.  Aegon  is  a  Dutch  insurance  company.  Ahold  Kon.,  namely  Koninklijke 
Ahold  N.V.,  is  also  a  Dutch  company,  dealing  in  retail  and  food-service  businesses. 
Air  Liquide  is  a  French  company  specializing  in  industrial  and  medical  gases. 

2  At  the  present  time,  the  euro  zone  is  made  up  of  the  following  countries: 
Austria,  Belgium,  Finland,  France,  Germany,  Greece,  Holland,  Ireland,  Italy,  Por¬ 
tugal,  and  Spain. 
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ABN  AMRO  HOLDING  AEGON 


AHOLD  KON.  AIR  LIQUIDE 


t 


Fig.  7.1.  Dataset  Eurost oxx50:  Evolution  of  the  first  four  stocks  over  the  period 
January  1,  1998  to  November  9,  2003 


is  constant  in  t  E  T.  Thus,  given  an  observed  sequence  xo :t  =  (#o>  •  •  •  ,  #t) 
from  a  homogeneous  Markov  chain,  the  associated  likelihood  is  given  by 

T 

£(6\x0-.t)  =  fo(xo\0)  Y[f(xt\xt- 1,9) , 

t=  1 

where  /o  is  the  distribution  of  the  starting  value  xq.  From  a  Bayesian  point  of 
view,  this  likelihood  can  be  processed  almost  as  in  an  iid  model  once  a  prior 
distribution  on  0  is  chosen. 

However,  a  generic  time  series  may  be  represented  in  formally  the  same 
way,  namely  through  the  full  conditionals  as  in 

T 

|x0:t)  =  fo(x0\6)  fJ/t(xt|xO:(t-l),0)  •  (7.1) 

t=  1 

When  this  function  can  be  obtained  in  a  closed  form,  a  Bayesian  analysis  is 
equally  possible. 
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Note  that  general  time-series  models  can  often  be  represented  as  Markov 
models  via  the  inclusion  of  missing  variables  and  an  increase  in  the  dimension 
of  the  model.  This  is  called  a  state-space  representation. 

7.1.2  Stability  of  Time  Series 

While  we  pointed  out  above  that,  once  the  likelihood  function  is  written  down, 
the  Bayesian  processing  of  the  model  is  the  same  as  in  the  iid  case,3  there 
exists  a  major  difference  that  leads  to  a  more  delicate  determination  of  the 
corresponding  prior  distributions  in  that  the  new  properties  of  stationarity  and 
causality  constraints  must  often  be  accounted  for.  We  cannot  embark  here  on 
a  mathematically  rigorous  coverage  of  stationarity  for  stochastic  processes  or 
even  for  time  series  (see  Brockwell  and  Davis,  1996),  thus  simply  mention 
(and  motivate)  below  the  constraints  found  in  the  time  series  literature. 

A  stochastic  process  ( xt)teT  is  stationary4  if  the  joint  distributions  of 
(xi, . . . ,  Xfc)  and  +^, . . .  ,Xk+h)  are  the  same  for  all  indices  k  and  k  +  h 
in  T.  Formally,  this  property  is  called  strict  stationarity  because  there  exists 
an  alternative  version  of  stationarity,  called  second- order  stationarity.  This 
alternative  imposes  invariance  in  time  only  on  first  and  second  moments  of 
the  process.  If  we  define  the  autocovariance  function  7X(-,  •)  of  the  process 
(xt)teT  by 


7 x(r,  s)  =  E[{ay  -  E(ay)}{;rs  -  E(xs)}],  r,  s  €  T, 

namely  the  covariance  between  xr  and  xs,  cov(xr,xs),  assuming  that  the 
variance  Y(xt)  is  finite,  a  process  ( xt)teT  with  finite  second  moments  is  second- 
order  stationary  if 

E (xt)  =  n  and  jx(r,  s)  =  jx(r  +  £,  5  +  t) 


for  all  r,  s,  t  £  T. 

If  (xt)teT  is  second-order  stationary,  then  7 x(r,s)  =  jx(\r  ~  s|,0)  for  all 
r,  s  G  T.  It  is  therefore  convenient  to  redefine  the  autocovariance  function  of 
a  second-order  stationary  process  as  a  function  of  just  one  variable;  i.e.,  with 
a  slight  abuse  of  notation, 


lx{h)  =  heT. 


3In  the  sense  that,  once  a  closed  form  of  the  posterior  is  available  as  in  (7.1), 
there  exist  generic  simulation  techniques  that  do  not  take  into  account  the  dynamic 
structure  of  the  model. 

4The  connection  with  the  stationarity  requirement  of  MCMC  methods  is  that 
these  methods  produce  a  Markov  kernel  such  that,  when  the  Markov  chain  is  started 
at  time  t  =  0  from  the  target  distribution  7r,  the  whole  sequence  ( xt)teN  is  stationary 
with  marginal  distribution  tv. 
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The  function  jx(-)  is  called  the  autocovariance  function  of  (xt)teTi  and  7 x(h) 
is  said  to  be  the  autocovariance  “at  lag”  h. 

The  autocorrelation  function  is  implemented  in  R  as  acf  (),  already  used 
in  Chaps.  4  and  5  for  computing  the  effective  sample  size  of  an  MCMC  sam¬ 
ple.  By  default,  the  function  acf  ()  returns  101og10(m)  autocorrelations  when 
applied  to  a  series  (vector)  of  size  m,  the  autocovariances  being  obtained 
with  the  option  type=" covariance",  and  it  also  produces  a  graph  of  those 
autocorrelations  unless  the  option  plot=FALSE  is  activated. 

An  illustration  of  acf  ()  for  the  ABN  Amro  stock  series  is  given  by 

>  data(Eurostoxx50) 

>  abnamro=Eurostoxx50 [,2] 

>  abnamro=ts (abnamro , f req=365-55*2 , start=1998) 

>  par (mf row=c (2 , 2) ,mar=c(4,4, 1,1)) 

>  plot . ts (abnamro , col="steelblue" ) 

>  acf (abnamro , lag=365-55*2) 

>  plot . ts (dif f (abnamro) , col="steelblue") 

>  acf (dif f (abnamro) ) 

whose  graphical  output  is  given  in  Fig.  7.2.  The  ts  function  turns  the  vector 
of  ABN  Amro  stocks  into  a  time  series,  which  explains  for  the  years  on  the 
first  axis  in  plot .  ts  and  the  relative  values  on  the  first  axis  in  acf,  where  1 . 0 
corresponds  to  a  whole  year.  (The  range  of  a  year  is  computed  by  adding  six 
bank  holidays  per  year  to  the  weekend  breaks.)  The  second  row  corresponds 
to  the  time-series  representation  of  the  first  difference  (xt+i  —  27),  a  standard 
approach  used  to  remove  the  clear  lack  of  stationarity  of  the  original  series. 
The  difference  in  the  autocorrelation  graphs  is  striking:  in  particular,  the 
complete  lack  of  significant  autocorrelation  in  the  first  difference  is  indicative 
of  a  random  walk  behavior  for  the  original  series. 


Obviously,  strict  stationarity  is  stronger  than  second-order  stationarity, 
and  this  feature  somehow  seems  more  logical  from  a  Bayesian  viewpoint  as  it 
is  a  property  of  the  whole  model.5  For  a  process  (xt)te n,  this  property  relates 
to  the  distribution  /o  of  the  starting  values. 

From  a  Bayesian  point  of  view,  to  impose  the  stationarity  condition  on  a 
model  (or  rather  on  its  parameters)  is  however  objectionable  on  the  grounds 
that  the  data  themselves  should  indicate  whether  or  not  the  underlying  model 
is  stationary.  In  addition,  since  the  datasets  we  consider  are  always  finite,  the 
stationarity  requirement  is  at  best  artificial  in  practice.  For  instance,  the  se¬ 
ries  in  Fig.  7.1  are  clearly  not  stationary  on  the  temporal  scale  against  which 
they  are  plotted.  However,  for  reasons  ranging  from  asymptotics  (Bayes  esti¬ 
mators  are  not  necessarily  convergent  in  nonstationary  settings)  to  causality, 


5 Nonetheless,  there  exists  a  huge  amount  of  literature  on  the  study  of  time  series 
based  only  on  second-moment  assumptions. 
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Fig.  7.2.  Representations  (left)  of  the  time  series  ABN  Amro  (top)  and  of  its  first 
difference  (bottom),  along  with  the  corresponding  acf  graphs  (right) 


to  identifiability  (see  below),  and  to  common  practice,  it  is  customary  to  im¬ 
pose  stationarity  constraints,  possibly  on  transformed  data,  even  though  a 
Bayesian  inference  on  a  nonstationary  process  could  be  conducted  in  prin¬ 
ciple.  The  practical  difficulty  is  that,  for  complex  models,  the  stationarity 
constraints  may  get  quite  involved  and  may  even  be  unknown  in  some  cases, 
as  for  some  threshold  or  changepoint  models.  We  will  expose  (and  solve)  this 
difficulty  in  the  following  sections. 


7.2  Autoregressive  (AR)  Models 

In  this  section,  we  consider  one  of  the  most  common  (linear)  time  series  mod¬ 
els,  the  AR(p)  model,  along  with  its  Bayesian  analyses  and  its  Markov  con¬ 
nections  (which  can  be  exploited  in  some  MCMC  implementations). 
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7.2.1  The  Models 


An  AR(1)  process  (xt)te z  (where  AR  stands  for  autoregressive)  is  defined  by 
the  conditional  relation  (t  G  Z), 


xt  =  n  +  g(xt~ i  -  //)  +  et , 


(7.2) 


where  (ct)tez  is  an  iid  sequence  of  random  variables  with  mean  0  and  variance 
a2  (that  is,  a  so-called  white  noise).  Unless  otherwise  specified,  we  will  only 
consider  the  e^’s  to  be  iid  aF(0,<t2)  variables.6 


If  |p|  <  1,  (xt)tez  can  be  written  as 


oo 


Xt 


/j.  +  ^2o3et. 
3=0 


-J  ’ 


(7.3) 


and  it  is  easy  to  see  that  this  is  a  unique  second-order  stationary  representa¬ 
tion.  More  surprisingly,  if  |  g  >  i,  the  unique  second-order  stationary  repre¬ 
sentation  of  (7.2)  is 


Xt  =  IX 


oo 


E 

3  = 1 


Q  Jet+j 


This  stationary  solution  is  frequently  criticized  as  being  artificial  because  it 
implies  that  xt  is  correlated  with  the  future  white  noises  (ct)s>t,  a  property  not 
shared  by  (7.3)  when  |p|  <  1.  While  mathematically  correct,  the  fact  that  xt 
appears  as  a  weighted  sum  of  random  variables  that  are  generated  after  time 
t  is  indeed  quite  peculiar,  and  it  is  thus  customary  to  restrict  the  definition  of 
AR(1)  processes  to  the  case  |p|  <  1  so  that  xt  has  a  representation  in  terms  of 
the  past  realizations  (ct)s<t-  Formally,  this  restriction  corresponds  to  so-called 
causal  or  future-independent  autoregressive  processes. /  Notice  that  the  causal 
constraint  for  the  AR(1)  model  can  be  naturally  associated  with  a  uniform 
prior  on  (—1, 1). 

Note  that,  when  we  replace  the  above  normal  sequence  (et)  with  another 
white  noise  sequence,  it  is  possible  to  express  an  AR(1)  process  with  \g\  >  1 
as  an  AR(1)  process  with  |p|  <  1.  However,  this  modification  is  not  helpful 
from  a  Bayesian  point  of  view  because  of  the  complex  distribution  of  the 
transformed  white  noise. 


6 Once  again,  there  exists  a  statistical  approach  that  leaves  the  distribution  of  the 
et  s  unspecified  and  only  works  with  first  and  second  moments.  But  this  perspective 
is  clearly  inappropriate  within  the  Bayesian  framework,  which  cannot  really  work 
with  half-specified  models. 

1-7  _____  .  . 

Both  stationary  solutions  above  exclude  the  case  \g\  =  1.  This  is  because  the 
process  (7.2)  is  then  a  random  walk  with  no  stationary  solution. 
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A  natural  generalization  of  the  AR(1)  model  is  obtained  by  increasing  the 
lag  dependence  on  the  past  values.  An  AR(p)  process  is  thus  defined  by  the 
conditional  (against  the  past)  representation  (t  G  Z), 

p 

xt  =  fi  +  Qi(xt-i  -  fj)  +  et ,  (7.4) 

i—  1 

where  (ct)tez  is  a  white  noise.  As  above,  we  will  assume  implicitly  that  the 
white  noise  is  normally  distributed.  This  natural  generalization  assumes  that 
the  p  most  recent  values  of  the  process  influence  (linearly)  the  current  value 
of  the  process.  As  for  the  AR(1)  model,  stationarity  and  causality  constraints 
can  be  imposed  on  this  model. 

A  lack  of  stationarity  of  a  time  series  theoretically  implies  that  the  series 
ultimately  diverges  to  Too.  An  illustration  of  this  property  is  provided  by  the 
following  R  code,  which  produces  four  AR(10)  series  of  260  points  based  on 
the  same  e^’s  when  the  coefficients  Qi  are  uniform  over  (  —  .5,  .5).  The  first  and 
the  last  series  either  have  coefficients  that  satisfy  the  stationarity  conditions 
or  have  not  yet  exhibited  a  divergent  trend.  Both  remaining  series  clearly 
exhibit  divergence. 

>  p=10 

>  T=260 

>  dat=seqz=rnorm(T) 

>  par (mf row=c (2 , 2) ,mar=c(2,2, 1,1)) 

>  for  (i  in  1:4){ 

+  coef =runif (p ,min=- . 5 ,max= . 5) 

+  for  (t  in  ((p+l):T)) 

+  seqz  [t] =sum(coef *seqz  [  (t-p) : (t-1) ] )+dat [t] 

+  plot (seqz ,ty="l" , col=" sienna" ,lwd=2,ylab="") 

+  > 

As  shown  in  Brockwell  and  Davis  (1996,  Theorem  3.1.1),  the  AR(p)  process 
(7.4)  is  both  causal  and  second-order  stationary  if  and  only  if  the  roots  of  the 
polynomial 

p 

V{u)  =  1  -  Qiui  (7-5) 

7=1 

are  all  outside  the  unit  circle  in  the  complex  plane.  (Remember  that  poly¬ 
nomials  of  degree  p  always  have  p  roots,  but  that  some  of  those  roots  may 
be  complex  numbers.)  While  this  necessary  and  sufficient  condition  on  the 
parameters  Qi  is  clearly  defined,  it  also  imposes  an  implicit  constraint  on  the 
vector  g  =  (pi, . . . ,  gp).  Indeed,  in  order  to  verify  that  a  given  vector  g  satisfies 
this  condition,  one  needs  first  to  find  the  roots  of  the  pth  degree  polynomial 
V  and  then  to  check  that  these  roots  all  are  of  modulus  larger  than  1.  In  other 
words,  there  is  no  clearly  defined  boundary  on  the  parameter  space  to  define 


7.2 


Autoregressive  (AR)  Models  217 


Fig.  7.3.  Four  simulation  of  an  AR(10)  series  of  260  points  when  based  on  the 
same  standard  normal  perturbations  et  and  when  the  coefficients  Qi  are  uniform 
over  (  —  .5,  .5) 


the  gs  that  satisfy  (or  do  not  satisfy)  this  constraint,  and  this  creates  a  major 
difficulty  for  simulation  applications,  given  that  simulated  values  of  g  need  to 
be  tested  one  at  a  time.  For  instance,  the  R  code 

>  maxi=0 

>  for  (i  in  (1:10~6))  maxi=maxi+ 

+  (max (Mod (polyroot (c (1 ,runif (10 , - . 5 , .5)))))>1) 

>  maxi/10~6 

[1]  1 

shows  that  no  simulation  out  of  one  million  simulated  coefficients  for  the 
AR(10)  model  that  satisfy  the  constraint.  It  is  therefore  very  likely  that  all 
series  in  Fig.  7.3  are  non-stationary. 

Note  that  the  general  AR (p)  model  is  Markov,  just  like  the  AR(1)  model, 
because  the  distribution  of  Xt+i  only  depends  on  a  fixed  number  of  past 
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values.  It  can  thus  be  expressed  as  a  regular  Markov  chain  when  considering 
the  vector,  for  t  >  p  —  1, 

Zt  (yXf  ,  Xt—  1,  .  .  .  ,  Xf-\-l—p  )  X^.p_|_X—  p)  • 


Indeed,  we  can  write 


^t+i  —  T  2-lp)  T  ^t+i  ? 


where 


lp  =  (l,...,l)TGMp,  5  = 


/  2l  22  (23 

1  0 

0  1  0 


^0 


•  Qp—2  Qp— l  2p\ 

0 

.  0  0  0 


0 


1  0  7 


and  et  =  (et,  0, . . . ,  0)T. 

If  we  now  consider  the  likelihood  associated  with  a  series  x0:t  of  obser¬ 
vations  from  a  Gaussian  AR(p)  process,  it  depends  on  the  unobserved  values 
X-p , . . . ,  X-i  since 


£(/i,  2l,  .  .  .  ,  Qp,  Cr|xo:T,  X_p:_i)  (X 


T 


a 


T- 1 


Rexp 


t=0 


p 


2 


/i  ^  ^  Qj  {Xj  —  i  fl) 


i=l 


2  <r' 


These  unobserved  initial  values  can  be  processed  in  various  ways  that  we 
now  describe.  First,  they  can  all  be  set  equal  to  /r,  but  this  is  a  purely  com¬ 
putational  convenience  with  no  justification.  Second,  if  the  stationarity  and 
causality  constraints  hold,  the  process  (xt)tez  has  a  stationary  distribution 
and  one  can  assume  that  x_p._i  is  distributed  from  the  corresponding  sta¬ 
tionary  distribution,  namely  a  M^(/xlp,  A)  distribution.  We  can  then  integrate 
those  initial  values  out  to  obtain  the  marginal  likelihood 


-1 

2  a2 


p 


i= 1 


>  /(X_p:_i|/Z,  A)dx_p:_i, 


based  on  the  argument  that  they  are  not  directly  observed.  This  likelihood  can 
be  dealt  with  analytically  but  is  more  easily  processed  via  a  Gibbs  sampler 
that  simulates  the  initial  values.  An  alternative  and  equally  coherent  approach 
is  to  consider  instead  the  likelihood  conditional  on  the  initial  observed  values 
x0:(p_i);  that  is, 
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£  (/A  Ql  ,  •  •  •  ,  Qp  ,  G  X p-T  ,  Xq.  (p—  1)  )  ex 


T 


-T+p— 1 


(j  ‘  1  ^  exp 


t—p 


p 


n  2 


-  A*  —  )tQi(xt-i  -  (J.) 


i— 1 


2<T 


(7.7) 


Unless  specified  otherwise,  we  will  adopt  this  approach.  In  this  case,  if  we  do 
not  restrict  the  parameter  space  through  stationarity  conditions,  a  natural 
conjugate  prior  can  be  found  for  the  parameter  0  =  (//,  £>,  a2),  made  up  of 
a  normal  distribution  on  (ja,  g)  and  an  inverse  gamma  distribution  on  a2. 
Instead  of  the  Jeffreys  prior,  which  is  controversial  in  this  setting  (see  Robert, 
2007,  Note  4.7.2),  we  can  also  propose  a  more  traditional  noninformative  prior 
such  as  7 r  (0)  =  1/<t2. 


7.2.2  Exploring  the  Parameter  Space  by  MCMC  Algorithms 

If  we  do  impose  the  causal  stationarity  constraint  on  g  that  all  the  roots 
of  V  in  (7.5)  be  outside  the  unit  circle,  the  set  of  acceptable  £>’s  becomes 
quite  involved  and  we  cannot,  for  instance,  use  as  prior  distribution  a  normal 
distribution  restricted  to  this  set,  if  only  because  we  lack  a  simple  algorithm 
to  properly  describe  the  set.  While  a  feasible  solution  is  based  on  the  partial 
autocorrelations  of  the  AR(p)  process  (see  Robert,  2007,  Sect.  4.5.2),  we  cover 
here  a  different  and  somehow  simpler  reparameterization  approach  using  the 
inverses  of  the  real  and  complex  roots  of  the  polynomial  P,  which  are  within 
the  unit  interval  (—1,1)  and  the  unit  sphere,  respectively.  Because  of  this 
unusual  structure  of  the  parameter  space,  involving  two  subsets  of  completely 
different  natures,  we  introduce  an  MCMC  algorithm  that  could  be  related 
with  birth  and  death  processes  and  simulation  in  variable  dimension  spaces. 

If  we  represent  the  polynomial  (7.5)  in  its  factorized  form 

p 

V(x)  =  R  (1  -  A ix) , 

i—  1 


the  inverse  roots,  \  (i  =  1, . . .  ,p),  are  either  real  numbers  or  complex  conju¬ 
gates.8 * *  Under  the  causal  stationarity  constraint,  a  natural  prior  is  then  to  use 
uniform  priors  for  these  roots,  taking  a  uniform  distribution  on  the  number 
rp  of  conjugate  complex  roots  and  uniform  distributions  on  [—1,1]  and  on  the 
unit  sphere  =  {AgC;|A|  <  1}  for  the  real  and  nonconjugate  complex 
roots,  respectively.  In  other  words, 


7r(A) 


1 

Lp/2J  + 1 


n  5'ia.ki  n 

A*  CK. 


(7.8) 


8 The  term  conjugate  is  to  be  understood  here  in  the  complex  calculus  sense  that 

if  l2  —  —  1  defines  the  standard  root  of  —1,  \  =  xei6  is  a  (complex)  root  of  V,  then 

A  =  xe~i6  is  also  a  (complex)  root  of  V. 
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where  [p/ 2j  +1  is  the  number  of  different  values  of  rp  and  the  second  product 
is  restricted  to  the  nonconjugate  roots  of  V .  (Note  that  the  quantity  i r  in  the 
denominator  is  in  fact  the  surface  of  the  unit  sphere  of  C.) 

Note  that  this  \pj 2j  +  1  factor,  while  unimportant  for  a  fixed  p  setting, 
must  necessarily  be  included  within  the  posterior  distribution  when  using  a 
birth-and-death  MCMC  algorithm  to  estimate  the  lag  order  p  since  it  does 
not  vanish  in  the  acceptance  probability  of  a  move  between  an  AR(p)  model 
and  an  AR(g)  model. 

While  the  connection  between  the  inverse  roots  and  the  coefficients  of 
the  polynomial  V  is  straightforward  (Exercise  7.10),  there  is  no  closed-form 
expression  of  the  posterior  distribution  either  on  the  roots  or  on  the  coeffi¬ 
cients.  Therefore,  a  numerical  approach  is  once  again  compulsory  to  approx¬ 
imate  the  posterior  distribution.  However,  any  Metropolis-Hastings  scheme 
can  work  here,  given  that  the  likelihood  function  can  be  easily  computed  in 
every  point. 

The  derivation  of  the  coefficients  Qi  of  the  autoregressive  model  from  the 
roots  follows  from  a  recursive  linear  procedure  explained  in  Exercise  7.10.  If 
pr  and  pc  denote  the  number  of  real  and  complex  roots  and  if  lr  and  lc  are 
the  real  and  (non-conjugate)  complex  roots,  respectively,  the  former  being  a 
vector  and  the  latter  a  two-column  matrix,  we  have 

Psi=matrix(0 ,ncol=p,nrow=p+l) 

Psi[l,]=l 

if  (pr>0){ 

Psi [2 , 1] =-lr  [1] 
if  (pr>l){ 

for  (i  in  2:pr) 

Psi [2: (i+1) ,i]=Psi[2: (i+1) , i-1] -lr  [i] *Psi  [1 : i , i-1] 

} 

> 

if  (pc>0){ 
if  (pr>0) { 

Psi [2 ,pr+2] =-2*lc  [1] +Psi  [2 ,pr] 

Psi [3 : (pr+3) ,pr+2]  =  (lc  [1] ~2+lc [2] ~2) *Psi [1 : (pr+1) ,pr] 
-2*lc[l]*Psi[2: (pr+2) ,pr]+Psi[3: (pr+3) ,pr] 

}else{ 

Psi  [2 , 2] =-2*lc  [1] ; 

Psi [3,2]  =  (lc  [1] ~2+lc  [2] "2) ; 

} 

if  (pc>2){ 

for  (i  in  seq(4,pc , 2) ) { 
pri=pr+i 
prim=pri-2 

Psi [2 ,pri] =-2*lc  [i-1] +Psi  [2 , prim] 
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Psi [3 : (pri+1) ,pri] =(lc [i-1] ~2+lc [i] ~2) *Psi [1 : (pri-1) , 

prim] -2*lc [i— 1] *psi [2 :pri ,prim] +Psi [3 : (pri+1) ,prim] 

> 

> 

> 

Rho=Psi [1 : (p+1) ,p] 

where  the  pi  s  are  the  opposites  of  the  components  of  Rho[2:p].  The 
log-likelihood  (7.7)  is  then  derived  in  a  straightforward  manner: 

x=x-mu 

loglike=0 

for  (i  in  (p+1) :T) 

loglike=loglike-(t  (Rho)0/O*0/Ox  [i  :  (i-p)] )  "2 
loglike= (loglike/s ig2- (T-p) *log (sig2) ) /2 
x=x+mu 

Since  the  conditional  likelihood  function  (7.7)  is  a  standard  Gaussian 
likelihood  in  both  p  and  cr,  we  can  directly  use  a  Gibbs  sampler  on  those 
parameters  and  opt  for  a  Metropolis-within-Gibbs  step  on  the  remaining  (in¬ 
verse)  roots  of  P,  A  =  (Ai, . . . ,  Xp).  A  potentially  inefficient9  if  straightforward 
Metropolis-Hastings  implementation  is  to  use  the  prior  distribution  7t(A)  it¬ 
self  as  a  proposal  on  A.  This  means  selecting  first  one  or  several  roots  of  P, 
Ai, . . . ,  A q  (1  <  q  <  p),  and  then  proposing  new  values  for  these  roots  that  are 
simulated  from  the  prior,  A'l5 . . . ,  A^  ~  7r(A).  (Reordering  the  roots  so  that  the 
modified  values  are  the  first  ones  is  not  restrictive  since  both  the  prior  and 
the  likelihood  are  permutation  invariant.)  The  acceptance  ratio  then  simplifies 
into  the  likelihood  ratio  by  virtue  of  Bayes’  theorem: 


£(/i,  A7,  (7 

xP:T,x0:(p— i))tt(/x,  X',a)  tt(A)  _£(fj,,X',cr 

Xp:T,X0:(p l)) 

£(p,  A,  a 

Xp:T,  x0:(p_i))7r(/Li,  A,  a)  7t(A')  £(fi,  A,  a 

Xp:T,X0:(p_l)) 

The  main  difficulty  with  this  scheme  is  that  one  must  take  care  to  modify 
complex  roots  by  (conjugate)  pairs.  This  means,  for  instance,  that  to  cre¬ 
ate  a  complex  root  (and  its  conjugate)  either  another  complex  root  (and  its 
conjugate)  or  two  real  roots  must  be  chosen  and  modified.  Formally,  this  is 
automatically  satisfied  by  simulations  from  the  prior  (7.8). 

One  possible  algorithmic  representation  is  therefore: 


9 Simulating  from  the  prior  distribution  when  aiming  at  the  posterior  distribution 
is  inevitably  leading  to  a  waste  of  simulations  if  the  data  is  informative  about  the 
parameters.  The  solution  is  of  course  unavailable  when  the  prior  is  improper. 
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Algorithm  7.13  Metropolis-Hastings  AR (p)  Sampler 

Initialization:  Choose  A^°\  /i^°\  and  <j(°) . 

Iteration  t  (t  >  1): 

1.  Select  one  root  at  random. 

If  the  root  is  real,  generate  a  new  real  root  from  the  prior  distribution. 
Otherwise,  generate  a  new  complex  root  from  the  prior  distribution 
and  update  the  conjugate  root. 

Replace  with  A*  using  these  new  values. 

Calculate  the  corresponding  g*  =  (g*, . . . ,  g*). 

Take  £  =  A*  with  probability 


xp:T  5  x0:(p— 1 


X)  7  X0:(p_i 


A  1 


and  £  =  A^-1^  otherwise. 

2.  Select  two  real  roots  or  two  complex  conjugate  roots  at  random. 

If  the  roots  are  real,  generate  a  new  complex  root  from  the  prior 
distribution  and  compute  the  conjugate  root. 

Otherwise,  generate  two  new  real  roots  from  the  prior  distribution. 
Replace  £  with  A*  using  these  new  values. 

Calculate  the  corresponding  g*  =  (g\,  . . . ,  g*). 

Accept  A^  =  A*  with  probability 


ic(A 

£c(/u(t_1),  a-(t_1)|xp:3’,x0:(p_i 


A  1 


and  set  A(t-  =  £  otherwise. 

3.  Generate  /A  by  a  random  walk  proposal. 
Accept  =  /x*  with  probability 


^c(/x(t-1),e(t))O-(t-1)|xp:T,x0:(p_1 


Al, 


and  set  /jS1’  =  pt_1)  otherwise. 

4.  Generate  by  a  log-random  walk  proposal. 
Accept  =  a*  with  probability 


£c(pW  ,  git)  ,  <7*  |Xp:T,  Xq.^,! 
^c(M(t))e(t),cr(t_1)|xp:T,x0:(p_i 


Al, 


and  set  a*-1'1  =  1 1  otherwise. 
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While  the  whole  R  code  is  too  long  (300  lines)  to  be  reproduced  here, 
the  core  part  about  the  modification  of  the  roots  can  be  implemented  as 
follows:  “down”  moves  removing  one  pair  of  complex  roots  are  chosen  with 
probability  0.1  while  “up”  moves  creating  one  pair  of  complex  roots  are  chosen 
with  probability  0.9,  in  order  to  compensate  for  the  inherently  higher  difficulty 
in  accepting  complex  proposals  from  the  prior.  Those  uneven  weights  must 
then  be  accounted  for  in  the  acceptance  probability,  along  with  the  changes 
in  the  masses  of  the  uniform  priors  on  the  real  and  complex  roots  in  (T.8). 

if  (runif (1) < . 1) {  #down 

ppropcomp=pcomp-2 ;  ppropreal=preal+2 

ind=sample (1 :pcomp, 1)  #indices  of  removed  complex  root 
ind=ind-  (ind°/o702==0) 


if  (ppropcomp>0) { 

lambpropcomp=lambdacomp [ ( (1 :pcomp) [-(ind: (ind+1))] 

}else{  #no  complex  root 

lambpropcomp=0  #dummy  necessary  for  ARllog  function 

> 

lambpropreal=c (lambdareal , 2*runif (2) -1) 

coef =9* (1+ (preal<2) ) * (pi/4)  #if  new  case  is  boundary 

}else{  #up 

ppropreal=preal-2 ;  ppropcomp=pcomp+2 

ind=sample (1 :preal ,2)  #indices  of  removed  real  roots 
if  (ppropreal>0) { 

lambpropreal=lambdareal [ ( 1 : preal) [-ind]  ] 

}else{ 

lambpropreal=0  #dummy  necessary  for  ARllog  function 

} 

theta=2*pi*runif (1) ;  rho=sqrt (runif (1) ) 

lambpropcomp=c (lambdacomp ,rho*cos (theta) ,rho*sin(theta) ) 

coef = (4/pi) * (1+ (ppropcomp<p-l) ) /9  #if  new  case  is 

boundary 

} 

the  boundary  cases  with  no  complex  root  or  less  than  two  real  roots  requiring  a 
special  processing  (not  reproduced  here) .  The  Metropolis-Hastings  acceptance 
step  is  then  simple: 

lloprop=ARllog (pr=ppropreal , pc=ppropcomp , 

lr=lambpropreal ,lc=lambpropcomp,mu, sig2) 

if  (log (runif (1) ) <log(coef )+lloprop-llo) { 
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llo=lloprop 

preal=ppropreal ;  pcomp=ppropcomp 
lambdacomp=lambpropcomp ;  lambdareal=lambpropreal 
> 


illustrating  the  role  of  the  coef  correction. 


As  an  application  of  the  above,  we  processed  the  Ahold  Kon.  series  of 
Eurostoxx50.  We  ran  the  algorithm  for  the  whole  series  with  p  =  5,  with 
satisfactory  jump  behavior  between  the  different  numbers  of  complex  roots. 
The  same  behavior  can  be  observed  with  larger  values  of  p.  Note  that  a  call 
to  the  non-Bayesian  R  ar( )  procedure  gives  an  order  of  1  for  this  series,  as 

>  ar(x  =  Eurostoxx50 [ ,  4]) 

Coefficients : 

1 

0.9968 

Order  selected  1  sigma~2  estimated  as  0.5399 

This  standard  analysis  is  very  unstable,  for  instance,  using  the  following 
alternative  produces  a  very  different  order  estimate! 

>  ar(x  =  Eurostoxx50  [ ,  4],  method  =  "ml") 

Coefficients : 

12345678 
1.042  -0.080  -0.038  0.080  -0.049  0.006  0.080  -0.043 

Order  selected  8  sigma~2  estimated  as  0.3228 


Figure  7.4  summarizes  the  MCMC  output  for  50,000  iterations.  The  top  left 
graph  shows  that  jumps  between  2  and  0  complex  roots  occur  with  high 
frequency  and  therefore  that  the  MCMC  algorithm  mixes  well  between  both 
(sub) models.  Both  following  graphs  on  the  first  row  relate  to  the  hyperpa¬ 
rameters  p  and  cr,  which  are  updated  outside  the  reversible  jump  steps.  The 
parameter  p  appears  to  be  mixing  better  than  cr,  which  is  certainly  due  to  the 
choice  of  the  same  scaling  factor  in  both  cases.  The  middle  rows  correspond  to 
the  first  three  coefficients  of  the  autoregressive  model,  pi,  £>2,  £3-  Their  stability 
is  a  good  indicator  of  the  convergence  of  the  reversible  jump  algorithm.  Note 
also  that,  except  for  pi,  the  other  coefficients  are  close  to  0  (since  their  poste¬ 
rior  means  are  approximately  0.052,  —0.0001,  2.99  x  10-5,  and  —2.66  x  10-7, 
respectively).  The  final  row  is  an  assessment  of  the  fit  of  the  model  and  the 
convergence  of  the  MCMC  algorithm.  The  first  graph  provides  the  sequence 
of  corresponding  log-likelihoods,  which  remain  stable  almost  from  the  start, 
the  second  the  distribution  of  the  complex  (inverse)  roots,  and  the  last  one 
the  connection  between  the  actual  series  and  its  one-step-ahead  prediction 


E[Xt+i\xt,xt. 


On  this  scale,  both  series  are  well-related. 


While  the  above  algorithm  is  a  regular  Metropolis-Hastings  algorithm 
on  a  parameter  space  with  a  fixed  number  of  parameters,  pi,...,pp,  the 
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real-complex  dichotomy  gives  us  the  opportunity  to  mention  a  new  class  of 
MCMC  algorithms,  variable  dimension  MCMC  algorithms.  The  class  of  vari¬ 
able  dimension  models  is  made  of  models  characterized  by  a  collection  of  sub¬ 
models,  9Jtfc,  often  nested,  that  are  considered  simultaneously  and  associated 
with  different  parameter  spaces.  The  number  of  submodels  can  be  infinite, 
and  the  “parameter”  is  defined  conditionally  on  the  index  of  the  submodel, 
6  =  (fc,  0/ c),  with  a  dimension  that  generally  depends  on  k.  It  naturally  occurs 
in  settings  like  Bayesian  model  choice  and  Bayesian  model  assessment. 

Inference  on  such  structures  is  obviously  more  complicated  than  on  single 
models,  especially  when  there  are  an  infinite  number  of  submodels,  and  it  can 
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Fig.  7.4.  Dataset  Eurostoxx50:  Output  of  the  MCMC  algorithm  for  the  Ahold 
Kon.  series  and  an  AR(5)  model:  ( top  row,  left )  histogram  and  sequence  of  numbers 
of  complex  roots  (ranging  from  0  to  4),  ( top  row,  middle  and  right )  sequence  of  /r 
and  cr2,  ( middle  row)  sequences  of  Qi  (i  —  1,2,3),  ( bottom  row,  left)  sequence  of 
observed  log-likelihood,  ( bottom  row,  middle)  representation  of  the  cloud  of  complex 
roots,  with  a  part  of  the  boundary  of  the  unit  circle  on  the  right,  ( bottom  row,  right) 
comparison  of  the  series  and  the  one-step-ahead  prediction 
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be  tackled  from  two  different  (or  even  opposite)  perspectives.  The  first  ap¬ 
proach  is  to  consider  the  variable  dimension  model  as  a  whole  and  to  estimate 
quantities  that  are  meaningful  for  the  whole  model  (such  as  moments  or  pre- 
dictives)  as  well  as  quantities  that  only  make  sense  for  submodels  (such  as 
posterior  probabilities  of  submodels  and  posterior  moments  of  Ok)-  From  a 
Bayesian  perspective,  once  a  prior  is  defined  on  0 ,  the  only  difficulty  is  in 
finding  an  efficient  way  to  explore  the  complex  parameter  space  in  order  to 
produce  these  estimators.  The  second  perspective  on  variable  dimension  mod¬ 
els  is  to  resort  to  testing ,  rather  than  estimation,  by  adopting  a  model  choice 
stance.  This  requires  choosing  among  all  possible  submodels  the  “best  one” 
in  terms  of  an  appropriate  criterion.  The  drawbacks  of  this  second  approach 
are  far  from  benign.  The  computational  burden  may  be  overwhelming  when 
the  number  of  models  is  infinite,  the  interpretation  of  the  selected  model  is 
delicate  and  the  variability  of  the  resulting  inference  is  underestimated  since 
it  is  impossible  to  include  the  effect  of  the  selection  of  the  model  in  the  assess¬ 
ment  of  the  variability  of  the  estimators  built  in  later  stages.  Nonetheless,  this 
is  an  approach  often  used  in  linear  and  generalized  linear  models  (Chaps.  3 
and  4)  where  subgroups  of  covariates  are  compared  against  a  given  dataset.  It 
is  obviously  the  recommended  approach  when  the  number  of  models  is  small, 
as  in  the  mixture  case  (Chap.  6)  or  in  the  selection  of  the  order  p  of  an  AR(p) 
model,  provided  the  Bayes  factors  can  be  approximated. 

MCMC  algorithms  that  can  handle  such  variable-dimension  structures  are 
facing  measure  theoretic  difficulties  and,  while  a  universal  and  elegant  solution 
through  reversible  jump  algorithms  exists  (Green,  1995),  we  have  made  the 
choice  of  not  covering  these  in  this  book.  An  introductory  coverage  can  be 
found  in  the  earlier  edition  (Marin  and  Robert,  2007,  Sect.  6.7),  as  well  as 
in  Robert  and  Casella  (2004).  Nonetheless,  we  want  to  point  out  that  the 
above  MCMC  algorithm  happens  to  be  a  special  case  of  the  birth-and-death 
MCMC  algorithm  (and  of  its  generalization,  the  reversible  jump  algorithm ) 
where,  in  nested  models,  additional  components  are  generated  from  the  prior 
distribution  and  the  move  to  a  larger  model  is  accepted  with  a  probability 
equal  to  the  ratio  of  the  likelihoods  (with  the  proper  reweighting  to  account 
for  the  multiplicity  of  possible  moves).  For  instance,  extending  the  above 
algorithm  to  the  case  of  the  unknown  order  p  is  straightforward. 


7.3  Moving  Average  (MA)  Models 

A  second  type  of  time  series  model  that  still  enjoys  linear  dependence  and 
closed- form  expression  is  the  M.A(q)  model,  where  MA  stands  for  moving 
average.  It  appears  as  a  dual  version  of  the  AR (p)  model. 

An  MA(1)  process  (xt)te z  is  such  that,  conditionally  on  the  past 

t  E  7~ , 


xt  =  ii  +  et-  , 


(7.9) 
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where  (et)teT  is  a  white  noise  sequence.  For  the  same  reasons  as  above,  we 
will  assume  the  white  noise  is  normally  distributed  unless  otherwise  specified. 
Thus, 

E[^t]  =  M ,  Y(xt)  =  (1  +  d2)a2  ,  7*(1)  =  'da2  ,  and  7 x{h)  =0  (h  >  1) . 

An  important  feature  of  (7.9)  is  that  the  model  is  not  identifiable  per  se. 
Indeed,  we  can  also  rewrite  xt  as 


l 

xt  =  fi  +  it-i  —  -it,  e  ~  JV (0,  v1  a2) . 

Therefore,  both  pairs  ($,  a)  and  (1/$,  ficr)  are  equivalent  representations  of 
the  same  model.  To  achieve  identifiability,  it  is  therefore  customary  in  (non- 
Bayesian  environments)  to  restrict  the  parameter  space  of  MA(1)  processes  by 


'd  <  1 , 


and  we  will  follow  suit.  Such  processes  are  called  invertible.  As  with  causality, 
the  property  of  inversibility  is  not  a  property  of  the  sole  process  (xt)tez  but 
of  the  connection  between  the  two  processes  (xt)teT  and  ( £t)teT • 

A  natural  extension  of  the  MA(1)  model  is  to  increase  the  dependence  on 
the  past  innovations,  namely  to  introduce  the  MA(q)  process  as  the  process 
(xt)teT  defined  by 

q 

xt  =  n  +  et  ,  (7.10) 

i— 1 

where  ( et)teT  is  a  white  noise  (once  again  assumed  to  be  normal  unless  oth¬ 
erwise  specified).  The  corresponding  identifiability  condition  in  this  model  is 
that  the  roots  of  the  polynomial 


q 


q{u)  =  i  -  yv 


;U 


i— 1 


are  all  outside  the  unit  circle  in  the  complex  plane  (see  Brockwell  and  Davis, 
1996,  Theorem  3.1.2,  for  a  proof).  Thus,  we  end  up  with  exactly  the  same 
parameter  space  as  in  the  AR (q)  case! 

The  intuition  behind  the  MA(g)  representation  is  however  less  straight¬ 
forward  than  the  regression  structure  underlying  the  AR(p)  model.  This 
representation  assumes  that  the  dependence  between  observables  stems  from 
a  dependence  between  the  (unobserved)  noises  rather  than  directly  through 
the  observables.  Furthermore,  in  contrast  with  the  AR (p)  models,  where  the 
covariance  between  the  terms  of  the  series  is  exponentially  decreasing  to  zero 
but  always  different  from  0,  the  autocovariance  function  for  the  MA  (q)  model 
is  such  that  jx(s)  is  equal  to  0  for  |s|  >  g,  meaning  that  xt+s  and  xt  are 
independent.  In  addition,  the  MA (q)  process  is  obviously  (second-order  and 
strictly)  stationary ,  whatever  the  vector  (#i, . . . ,  $g),  since  the  white  noise  is 
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iid  and  the  distribution  of  (7.10)  is  thus  independent  of  t.  A  major  differ¬ 
ence  between  the  MA(q)  and  the  AR (p)  models,  though,  is  that  the  MA(q) 
dependence  structure  is  not  Markov  (even  though  it  can  be  represented  as  a 
Markov  process  through  a  state-space  representation ,  introduced  below). 

While,  in  the  Gaussian  case,  the  whole  (observed)  vector  xi:t  is  a  realiza¬ 
tion  of  a  normal  random  variable,  with  constant  mean  (i  and  covariance  matrix 
A,  and  thus  provides  a  formally  explicit  likelihood  function,  both  the  compu¬ 
tation  and  the  integration  (or  maximization)  of  this  likelihood  are  quite  costly 
since  they  involve  inverting  the  huge  matrix  A.10 

A  more  manageable  representation  of  the  M.A(q)  likelihood  is  to  use  the 
likelihood  of  xi-^  conditional  on  the  past  white  noises  eo, . . . ,  e_g+i, 


1?1,  .  .  .  ,  tig,  a |xi:T,  e(-<3+l):o)  oc 


T 


T 

n 


a  i  i  exp 

t= 1 


i 


q 


xt  -  p  +  ^  fijet-?  I  /  2<j“ 

3  = 1 


1 


(7.11) 


where  eo  =  eo,.  . .,  e\-q  =  e\-q  and  (t  >  0) 


q 

et  =  xt  -  n  +  y • 

3  = 1 


This  recursive  definition  of  the  likelihood  is  still  costly  since  it  involves  T  sums 
of  q  terms.  Nonetheless,  even  though  the  problem  of  handling  the  conditioning 
values  e(_g+i):0  must  be  treated  separately  via  an  MCMC  step,  the  complexity 
O  (Tq)  of  this  representation  is  much  more  manageable  than  the  normal  exact 
representation  mentioned  above. 

Since  the  transform  of  the  roots  into  the  coefficients  is  exactly  the  same  as 
with  the  AR(q)  model,  the  expression  of  the  log-likelihood  function  conditional 
on  the  past  white  noises  eps  is  quite  straightforward.  Taking  for  Psi  the 
subvector  Psi  [2 :  (pH- 1 )  ,p] ,  the  computation  goes  as  follows: 

x=x-mu 

#  construction  of  the  epsilonhats 
heps=rep (0 , T+q) 

heps [1 : q] =eps  #  past  noises 
for  (i  in  1:T) 

heps [p+i] =x [i] +sum(rev(Psi) *heps [i : (q+i-1)] ) 

#  completed  loglikelihood  (includes  negative  epsilons) 
loglike=- ( (sum (heps ~2) /sig2)+(T+q) *log(sig2) ) /2 
x=x+mu 


10  Obviously,  taking  advantage  of  the  block  diagonal  structure  of  A — due  to  the 
fact  that  7a; (s)  =  0  for  \s\  >  q —  may  reduce  the  computational  cost,  but  this  requires 
advanced  programming  abilities! 
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Given  both  xi:y  and  the  past  noises  e(_g+iv0,  the  conditional  posterior 
distribution  of  the  parameters  (/i,  $1, . . . ,  a)  is  formally  very  close  to  the 
posterior  associated  with  an  AR (q)  posterior  distribution.  This  proximity  is 
such  that  we  can  recycle  the  code  of  Algorithm  7.13  to  some  extent  since 
the  simulation  of  the  (inverse)  roots  of  the  polynomial  Q  is  identical  once 
we  modify  the  likelihood  according  to  the  above  changes.  The  past  noises  e_i 
(i  =  1, . . . ,  q)  are  simulated  conditional  both  on  the  xt’s  and  on  the  parameters 
/i,  a  and  $  =  ($i, . . . ,  $q).  While  the  exact  distribution 


o 


/(e(_g+1):0|xi  .T,n,o,-d)  cx  Yl 

i=  —  q+ 1 


T 

n 

t= i 


e?/ 2<t' 


(7.12) 


where  the  e*’s  are  defined  as  above,  is  exactly  a  normal  distribution  on  the 
vector  e(_q+1):0  (Exercise  T.13),  its  computation  is  too  costly  to  be  available 
for  realistic  values  of  T.  We  therefore  implement  a  hybrid  Gibbs  algorithm 
where  the  missing  noise  e(_g+1);0  is  simulated  from  a  proposal  based  either 
on  the  previous  simulated  value  of  e(_g+1):0  (in  which  case  we  use  a  simple 
termwise  random  walk)  or  on  the  first  part  of  (7.12)  (in  which  case  we  can  use 
normal  proposals).11  More  specifically,  one  can  express  U  (1  <  t  <  q)  in  terms 
of  the  6-t  s  and  derive  the  corresponding  (conditional)  normal  distribution  on 
either  each  e_t  or  on  the  whole  vector  e12  (see  Exercise  7.14). 

The  additional  step,  when  compared  with  the  AR (p)  function,  is  the  con¬ 
ditional  simulation  of  the  past  noises  €(^q+i ):0.  For  1  <  i  <  q  (the  indices  are 
drifted  to  start  at  1  rather  than  —  <7),  the  corresponding  part  of  our  R  code  is  as 
follows.  Unfortunately,  the  derivation  of  the  Metropolis-Hastings  acceptance 
probability  does  require  computing  the  inverse  e_t’s  as  they  are  functions  of 
the  proposed  noises. 


x=x-mu 

heps [1 : q] =eps  #  simulated  ones 

for  (j  in  (q+1) : (2*p+l) )  #  epsilon  hat 

heps [j] =x [j] +sum(rev(Psi) *heps [ ( j  — q) : ( j  —  1 ) ] ) 

heps  [i] =0 

for  (j  in  l:(q-i+l)) 

keps [j] =x [j] +sum(rev(Psi) *heps [j : (j+q-1)]  ) 
x=x+mu 


11  In  the  following  output  analysis,  we  actually  used  a  more  hybrid  proposal  with 
the  innovations  U’s  (1  <  t  <  q)  fixed  at  their  previous  values.  This  approximation 
remains  valid  when  accounted  for  in  the  Metropolis-Hastings  acceptance  ratio,  which 
requires  computing  the  U’s  associated  with  the  proposed  e_i. 

12Using  the  horizon  t  =  q  is  perfectly  sensible  in  this  setting  given  that  x±, . . . ,  xq 
are  the  only  observations  correlated  with  the  e~t  s,  even  though  (7.11)  gives  the 
impression  of  the  opposite,  since  all  e*’s  depend  on  the  e~t  s. 
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epsvar=l/sum(c (1 ,Psi [i : q]  ~2)  ) 
epsmean=sum(Psi  [i : q] *keps  [1 : (q-i+1) ] ) *epsvar 
epsmean=epsmean/epsvar 
epsvar=sig2*epsvar 

propeps=rnorm(l ,mean=epsmean, sd=sqrt (epsvar) ) 

epspr=eps 

epspr [i] =propeps 

lloprop=MAllog (pr=preal , pc=pcomp , lr=lambdareal , 

lc=lambdacomp ,mu=mu, sig2=sig2 , compsi=FALSE,pepsi=Psi , 
eps=epspr) 

propsall=dnorm(propeps ,mean=epsmean, sd=sqrt (epsvar) , log=TRUE) 


x=x-mu 

heps [i] =propeps 

for  (j  in  (q+1) : (2*q+l)) 

heps [j] =x [j] +sum(rev(Psi) *heps [ ( j  — q) : ( j  —  1 ) ]  ) 

heps  [i] =0 

for  (j  in  1: (q-i+1)) 

keps [j] =x [j] +sum(rev(Psi) *heps [j : (j+q-1)] ) 
x=x+mu 

epsvar=l/ sum(c (1 ,Psi [i : q]  ~2) ) 
epsmean=sum(Psi [i : q] *keps [1 : (q-i+1)  ]  ) 
epsmean=epsmean*epsvar 
epsvar=sig2*epsvar 

propsalO=dnorm(eps [i] ,mean=epsmean, sd=sqrt (epsvar) ,log=TRUE) 

if  (log(runif (1) ) <lloprop-llo-propsall+propsalO) { 
eps [i] =propeps ; 
llo=lloprop 
} 

The  complete  R  code  also  includes  an  additional  random  walk  perturbation 
of  the  e£,  centered  on  the  proposal 

propeps  =  rnorm(l ,mean=eps  [i] , sd=0 . l*sqrt (sig2) ) 

in  order  to  increase  the  mixing  properties  of  the  chain.  Apart  from  those 
changes,  the  R  code  is  identical  to  the  code  used  for  the  AR(p)  model. 
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Algorithm  7.14  MCMC  MA(g)  Sampler 

Initialization:  Choose  A^°\  e^°\  /i^°\  and  cr^  arbitrarily. 

Iteration  t  (t  >  1): 

1.  Run  steps  1-4  of  Algorithm  7.13  conditional  on  with  the  correct 
corresponding  conditional  likelihood. 

2.  Simulate  by  a  Metropolis-Hastings  step. 


To  illustrate  the  behavior  of  this  algorithm,  we  considered  the  first  350 
points  of  the  Air  Liquide  series  in  Eurostoxx50.  The  output  is  represented 
on  Fig.  7.5  for  q  =  9  and  10,000  iterations  of  Algorithm  7.14,  with  the  same 
conventions  as  in  Fig.  7.4,  except  that  the  lower  right  graph  represents  the 
series  of  the  simulated  e_t’s  rather  than  the  predictive  behavior. 

Interestingly,  the  likelihood  found  by  the  algorithm  as  the  iteration  pro¬ 
ceeds  is  (numerically)  much  higher  than  the  one  found  by  the  classical  R  a  rim  a 
procedure  since  it  differs  by  a  factor  of  450  on  the  log  scale  (assuming  we  are 
talking  of  the  same  quantity  since  R  a  rim  a  computes  the  log-likelihood  associ¬ 
ated  with  the  observations  without  the  e_i’s!).  The  details  of  the  call  to  arima 
are  as  follows: 


>  arima(x  =  Eurostoxx50  [1 : 350 ,  5],  order  =  c(0,  0,  9)) 

Coefficients : 

mal 

ma2 

ma3  ma4 

ma5 

ma6  ma7 

1.0605 

0.9949 

0.9652  0.8542 

0.8148 

0.7486  0.5574 

s.e.  0.0531 

0.0760 

0.0881  0.0930 

0.0886 

0.0827  0.0774 

ma8 

ma9 

intercept 

0.3386 

0.1300 

114.3146 

s.e.  0 . 0664 

0.0516 

1.1281 

sigma~2  estimated  as 

8.15:  log  likelihood  = 

-864.97 

The  favored  number  of  complex  roots  is  6,  and  the  smaller  values  0  and  2  are 
not  visited  after  the  initial  warmup.  The  mixing  over  the  a  parameter  is  again 
lower  than  over  the  mean  /i,  despite  the  use  of  three  different  proposals.  The 
first  one  is  based  on  the  inverted  gamma  distribution  associated  with 
the  second  one  is  based  on  a  (log)  random  walk  with  scale  0.1  ax,  and  the  third 
one  is  an  independent  inverted  gamma  distribution  with  scale  cjx/(1  +  ^  + 
. . .  +  t^)1/2.  Note  also  that,  except  for  $9,  the  other  coefficients  di  are  quite 
different  from  0  (since  their  posterior  means  are  approximately  1.0206,  0.8403, 
0.8149,  0.6869,  0.6969,  0.5693,  0.2889,  and  0.0895,  respectively).  This  is  also 
the  case  for  the  estimates  above  obtained  in  R  arima.  The  prediction  being  of 
little  interest  for  MA  models  (Exercise  7.15),  we  represent  instead  the  range 
of  simulated  e^’s  in  the  bottom  right  figure.  The  range  is  compatible  with  the 
o/F(0,  cr2)  distribution. 
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Fig.  7.5.  Dataset  Eurostoxx50:  Output  of  the  MCMC  algorithm  for  the  Air  Liq- 
uide  series  and  an  MA(9)  model:  ( top  row,  left )  histogram  and  sequence  of  numbers 
of  complex  roots  (ranging  from  0  to  8);  ( top  row,  middle  and  right)  sequence  of  p 
and  cr2;  ( middle  row)  sequences  of  th  ( i  =  1,2,3);  ( bottom  row,  left)  sequence  of 
observed  likelihood;  ( bottom  row,  middle)  representation  of  the  cloud  of  complex 
roots,  with  the  boundary  of  the  unit  circle;  and  ( bottom  row,  right)  evolution  of  the 
simulated  e~t  s 


7.4  ARMA  Models  and  Other  Extensions 

An  alternative  approach  that  is  of  considerable  interest  for  the  representation 
and  analysis  of  the  MA (q)  model  and  its  generalizations  is  the  so-called  state- 
space  representation ,  which  relies  on  missing  variables  to  recover  both  the 
Markov  structure  and  the  linear  framework.13 

The  general  idea  is  to  represent  a  time  series  (xt)  as  a  system  of  two 
equations, 


13It  is  also  inspired  from  the  Kalman  filter,  ubiquitous  for  prediction,  smoothing, 
and  filtering  in  time  series. 
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xt  —  Gyt  +  st , 

yt+i  =  +  It , 


(7.13) 

(7.14) 


where  and  are  multivariate  normal  vectors14  with  general  covariance 
matrices  that  may  depend  on  t  and  =  0  for  all  (it,  t>)’s.  Equation  (7.13) 

is  called  the  observation  equation ,  while  (7.14)  is  called  the  state  equation.  This 
representation  embeds  the  process  of  interest  (xt)  into  a  larger  space,  the  state 
space ,  where  the  missing  process  (yt)  is  Markov  and  linear.  For  instance,  (7.6) 
is  a  state-space  representation  of  the  AR (p)  model  (see  Exercise  7.16). 

The  MA (q)  model  can  be  written  that  way  by  defining  yt  as 


Then  the  state  equation  is 


Yt+i 


( ^t—q •> 

5  D — i  At) 

/0  1  0  . . . 

0\ 

(°\ 

0  0  1... 

0 

y  t  +  €t+ 1 

0 

0  0  0... 

1 

0 

\0  0  0  . . . 

0/ 

w 

while  the  observation  equation  is 

:t=xt  =  fjL-  (fig  tfq- 1  . . .  til  -l)  yt , 


x, 


(7.15) 


with  no  perturbation  et. 

The  state-space  decomposition  of  the  MA  (q)  model  thus  involves  no  vector 
et  in  the  observation  equation,  while  is  degenerate  in  the  state  equation. 
The  degeneracy  phenomenon  is  quite  common  in  state-space  representations, 
but  this  is  not  a  hindrance  in  conditional  uses  of  the  model,  as  in  MCMC 
implementations.  Notice  also  that  the  state-space  representation  of  a  model 
is  not  unique,  again  a  harmless  feature  for  MCMC  uses.  For  instance,  for  the 
MA(1)  model,  the  observation  equation  can  also  be  chosen  as  xt  =  /x  +  (l  0)yt 
with  yt  =  (yit,y2t)J  directed  by  the  state  equation 


Yt+1  (o  o)  Yt  +  (-i?i)  et+1 ' 


Note  that,  while  the  state-space  representation  is  wide-ranging  and  con¬ 
venient,  it  does  not  mean  that  the  derived  MCMC  strategies  are  necessarily 
efficient.  In  particular,  when  the  hidden  state  xt  is  too  large,  a  naive  com¬ 
pletion  may  prove  itself  disastrous.  Alternative  solutions  based  in  sequential 
importance  sampling  (SMC)  have  been  shown  to  be  usually  more  efficient. 
(See  Del  Moral  et  ah,  2006.) 

14 Notice  the  different  fonts  that  distinguish  the  e^s  used  in  the  state-space  rep¬ 
resentation  from  the  e*’s  used  in  the  AR  and  MA  models. 
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A  straightforward  extension  of  both  previous  AR  and  MA  models  are  the 
(normal)  ARMA(p,  <7)  models,  where  xt  (t  G  Z)  is  conditionally  defined  by 


p  Q 

Xt  =  -fj,)  +  et  et  ~  Jf(0,<T2) ,  (7.16) 

i=l  j=l 


the  (et)’s  being  independent.  The  role  of  such  models,  as  compared  with  both 
AR  and  MA  models,  is  to  aim  toward  parsimony;  that  is,  to  resort  to  much 
smaller  values  of  p  and  q  than  in  a  pure  AR(p)  or  a  pure  MA(q)  modeling. 

The  causality  and  inversibility  conditions  on  the  parameters  of  (7.16) 
still  correspond  to  the  roots  of  both  polynomials  V  and  Q  being  outside 
the  unit  circle,  respectively,  with  a  further  condition  that  both  polynomials 
have  no  common  root.  (But  this  almost  surely  never  happens  under  a  contin¬ 
uous  prior  on  the  parameters.)  The  root  reparameterization  can  therefore  be 
implemented  for  both  the  d/s  and  the  Qi  s,  still  calling  for  MCMC  techniques 
owing  to  the  complexity  of  the  posterior  distribution. 

State-space  representations  also  exist  for  ARMA(p,  q )  models,  one  possi¬ 
bility  being 


xt  =  xt  =  p  — 
for  the  observation  equation  and 


yt+i 


/o  1 
0  0 

0  0 

\Qr  Qr—1 


(dr_i 

dr_ 2 

■  ■  $i  -l)  y t 

0  . 

..  0  \ 

(°\ 

1  . 

..  0 

0 

• 

y  t  +  et+ 1 

• 

0  . 

..  1 

0 

Qr- 2  • 

.  .  Qx) 

[i 

(7.17) 


for  the  state  equation,  with  r  =  max(p,  <7  +  1)  and  the  convention  that  gm  =  0 
if  m  >  p  and  rdrn  =  0  if  m  >  q. 

Similarly  to  the  MA (q)  case,  this  state-space  representation  is  handy  in 
devising  MCMC  algorithms  that  converge  to  the  posterior  distribution  of  the 
parameters  of  the  ARMA(p,  q)  model. 

A  straightforward  MCMC  processing  of  the  ARMA  model  is  to  take 
advantage  of  the  AR  and  MA  algorithms  that  have  been  constructed  above  by 
using  both  algorithms  sequentially.  Indeed,  conditionally  on  the  AR  param¬ 
eters,  the  ARMA  model  can  be  expressed  as  an  MA  model  and,  conversely, 
conditionally  on  the  MA  parameters,  the  ARMA  model  can  be  expressed 
almost  as  an  AR  model.  This  is  quite  obvious  for  the  MA  part  since,  if  we 
define  (t  >  p) 

p 

Xt  =  Xt  -  H+/2  -  At)  , 

i= 1 

the  likelihood  is  formally  equal  to  a  standard  MA(g)  likelihood  on  the  Ty’s. 
The  reconstitution  of  the  AR(p)  likelihood  is  more  involved:  If  we  now  define 
the  residuals  et  =  ^ jet-j,  the  log-likelihood  conditional  on  x0:(p_i)  is 
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'2a2 , 


t—p 


3  = 1 


which  is  obviously  close  to  an  AR(p)  log-likelihood,  except  for  the  e^’s.  The 
original  AR(p)  MCMC  code  can  then  be  recycled  modulo  this  modification  in 
the  likelihood. 

Another  extension  of  the  AR  model  is  the  ARCH  model,  used  to  represent 
processes,  particularly  in  finance,  with  independent  errors  but  time-dependent 
variances,  as  in  the  ARCH(p)  process15  (t  E  Z) 

p 

Xt  =  atet ,  et  ~  <yT(0, 1) ,  a2  =  a  +  ^  fax2^  . 

i— 1 


The  ARCH(p)  process  defines  a  Markov  chain  since  xt  only  depends  on 
X£_P:t_i.  It  can  be  shown  that  a  stationarity  condition  for  the  ARCH(l)  model 
is  that  E[log(/3ie^)]  <  0,  which  is  equivalent  to  [3\  <  3.4.  This  condition  be¬ 
comes  much  more  involved  for  larger  values  of  p.  Contrary  to  the  stochastic 
volatility  model  defined  below,  the  ARCH(p)  model  enjoys  a  closed- form  like¬ 
lihood  when  conditioning  on  the  initial  values  xi, ...  ,xp.  However,  because 
of  the  nonlinearities  in  the  variance  terms,  approximate  methods  based  on 
MCMC  algorithms  must  be  used  for  their  analysis. 

State-space  models  are  special  cases  of  hidden  Markov  models  (detailed 
below  in  Sect.  7.5)  in  the  sense  that  (7.13)  and  (7.14)  are  a  special  occurrence 
of  the  generic  representation 


Xt  T<'yt,eC  (7-18) 

yt  =  F(yt-i,(t)  ■ 

Note,  however,  that  it  is  not  necessarily  appealing  to  resort  to  this  hidden 
Markov  representation,  in  comparison  with  state-space  models,  because  the 
complexity  of  the  functions  F  or  G  may  hinder  the  processing  of  this  repre¬ 
sentation  to  unbearable  levels  (while,  for  state-space  models,  the  linearity  of 
the  relations  always  allows  for  a  generic  if  not  necessarily  efficient  processing 
based  on,  e.g.,  Gibbs  sampling  steps). 

Stochastic  volatility  models  are  quite  popular  in  financial  applications, 
especially  in  describing  series  with  sudden  and  correlated  changes  in  the  mag¬ 
nitude  of  variation  of  the  observed  values.  These  models  use  a  hidden  chain 
(yt)te n,  called  the  stochastic  volatility ,  to  model  the  variance  of  the  observ¬ 
ables  (xt)te N  hr  !he  following  way:  Let  yo  Ab( 0,  a2)  and,  for  t  —  1, . . . ,  T, 
define 

15  The  acronym  ARCH  stands  for  autoregressive  conditional  heteroscedasticity , 
heteroscedasticity  being  a  term  favored  by  econometricians  to  describe  heteroge¬ 
neous  variances.  Gourieroux  (1996)  provides  a  general  reference  on  these  models,  as 
well  as  classical  inferential  methods  of  estimation. 
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I  yt  =  <pyt-i  +  oe*_i , 

\x,  =  ,  (7-19) 

where  both  et  and  are  iid  W(0, 1)  random  variables.  In  this  simple  version, 
the  observable  is  thus  a  white  noise,  except  that  the  variance  of  this  noise 
enjoys  a  particular  AR(1)  structure  on  the  logarithmic  scale.  Quite  obviously, 
this  structure  makes  the  computation  of  the  (observed)  likelihood  a  formidable 
challenge! 

Figure  7.6  gives  the  sequence  (log(xt)  —  log(xt-i)}  when  (xt)  is  the  Aegon 
stock  sequence  plotted  in  Fig.  7.1.  While  this  real-life  sequence  is  not  necessar¬ 
ily  a  stochastic  volatility  process,  it  presents  some  features  that  are  common 
with  those  processes,  including  an  overall  stationary  structure  and  periods  in 
the  magnitude  of  the  variation  of  the  sequence. 
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Fig.  7.6.  Dataset  Eurostoxx50:  First-order  difference  {log(xt)  —  log(xt-i)}  of  the 
Aegon  stock  sequence  regarded  as  a  potential  stochastic  volatility  process  (7.19) 

When  comparing  ARMA  with  the  hidden  Markov  models  of  the  following 
Section,  it  may  appear  that  the  former  are  more  general  in  the  sense  that 
they  allow  a  different  dependence  on  the  past  values.  Resorting  to  the  state- 
space  representation  (7.18)  shows  that  this  is  not  the  case.  Different  horizons 
p  of  dependence  can  also  be  included  for  hidden  Markov  models  simply  by 
(a)  using  a  vector  ~xt  =  (xt-p+i,  •  •  • ,  %t)  for  the  observables  or  by  (b)  using  a 
vector  yt  =  (yt-q+ 1, . . . ,  yt)  for  the  latent  process  in  (7.18). 

7.5  Hidden  Markov  Models 

Hidden  Markov  models  are  a  generalization  of  the  mixture  models  of  Chap.  6. 
Their  appeal  within  this  chapter  is  that  they  constitute  an  interesting  case 
of  non-Markov  time  series,  besides  being  extremely  useful  in  modeling,  e.g., 
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for  financial,  telecommunication,  and  genetic  data.  We  refer  the  reader  to 
McDonald  and  Zucchini  (1997)  for  a  deeper  introduction  to  these  models  and 
to  Cappe  et  al.  (2004)  and  Fruhwirth-Schnatter  (2006)  for  a  complete  coverage 
of  their  statistical  processing. 


7.5.1  Basics 


The  family  of  hidden  Markov  models  (abbreviated  to  HMM)  consists  of  a 
bivariate  process  (xt,yt)te n,  where  the  unobserved  subprocess  (yt)te n  is  a  ho¬ 
mogeneous  Markov  chain  on  a  state  space  3/  and,  conditional  on  (yt)te n, 
(xt)teN  is  a  series  of  random  variables  on  3E  such  that  the  conditional  distri¬ 
bution  of  xt  given  yt  and  the  past  (xj,  yj)j<t  only  depends  on  yt,  as  represented 
by  the  DAG  in  Fig.  7.7.  When  3/  —  {1, . . . ,  k},  i.e.  when  the  hidden  Markov 
chain  takes  a  finite  number  of  possible  values,  we  have,  in  particular, 


Xt\yt  ~  f(x\£yt) 

where  (yt)te n  thus  is  a  finite  state-space  Markov  chain,  meaning  that  yt\yt-i 
is  distributed  from 


p(yt  =  Avt- 1  =  j)  =Pji , 


1  <  i  <  n , 


and  the  ^’s  are  the  different  parameters  indexing  the  conditional  distribution. 
In  the  general  case,  the  joint  distribution  of  (xtlyt)  given  the  past  values 
x0:(t-i)  =  (xo,  •  •  • , xt-i)  and  y0:(t_i)  =  ( yo ,  •  •  • ,  Vt- 1)  factorizes  as 

(xt,yt) |x0:(i-i),yo:(i-i)  ~  f  (yt\yt-i)  f  (xt\yt) , 

in  agreement  with  Fig.  7.7.  The  process  (yt)te N  is  usually  referred  to  as  the 
state  of  the  model  and,  again,  is  not  observable  (hence,  hidden).  Inference  thus 
has  to  be  carried  out  only  in  terms  of  the  observable  process  (xt)teN- 


Fig.  7.7.  Directed  acyclic  graph  (DAG)  representation  of  the  dependence  structure 
of  a  hidden  Markov  model,  where  (xt)teN  is  the  observable  process  and  (yt)te n  the 
hidden  process 


Simulating  a  hidden  Markov  chain  is  then  straightforward:  we  start  with 
the  simulation  of  the  hidden  layer,  i.e.  of  the  process  (yt)t=i,...,T  and  proceed 


238 


7  Time  Series 


IMMIJliM 

flf^^ 


Fig.  7.8.  Dataset  Dnadataset:  Sequence  of  9718  amine  bases  for  an  HIV  genome. 
The  four  bases  A,  C,  G,  and  T  have  been  recoded  as  1, . . . ,  4 


to  simulating  each  xt  conditional  on  the  corresponding  yt  (t  =  1, . . . ,  T).  The 
corresponding  computing  time  is  linear  in  T  (Exercise  7.18). 

Hidden  Markov  models  have  been  used  in  genetics  since  the  early  1990s  for 
the  modeling  of  DNA  sequences.  In  short  (and  with  no  ambition  at  complete¬ 
ness!),  DNA,  which  stands  for  deoxyribonucleic  acid,  is  a  molecule  that  carries 
the  genetic  information  about  a  living  organism  and  is  replicated  in  each  of 
its  cells.  This  molecule  is  made  up  of  a  sequence  of  amine  bases — adenine, 
cytosine,  guanine,  and  thymine — abbreviated  as  A,  C,  G,  and  T.  The  par¬ 
ticular  arrangement  of  bases  in  different  parts  of  the  sequence  is  thought  to 
be  related  to  different  characteristics  of  the  living  organism  to  which  it  cor¬ 
responds.  Dnadataset  is  a  particular  sequence  corresponding  to  a  complete 
HIV  (which  stands  for  Human  Immunodeficiency  Virus)  genome  where  A,  C, 
G,  and  T  have  been  recoded  as  1, . . . ,  4.  Figure  7.8  represents  this  sequence  of 
9718  bases  by  decomposing  it  into  five  blocks.  The  simplest  modeling  of  this 
sequence  is  to  assume  a  two-state  hidden  Markov  model  with  3/  =  {1,2}  and 
3£  =  {1,2,  3, 4},  the  assumption  being  that  one  hidden  state  corresponds  to 
noncoding  regions  and  the  other  hidden  state  to  coding  regions. 


For  statistical  purposes,  the  distributions  of  both  xt  and  yt  are  usually 
parameterized,  that  is,  (7.18)  looks  like 

G(Vt')  G  |$)  5  orv\ 

yt=F(yt-UCt\S),  1  j 

where  et  and  are  independent  perturbations  ( white  noise )  and  where  6  and 
S  are  finite-dimensional  parameters. 

To  draw  inference  on  either  the  parameters  of  the  HMM  or  on  the  hid¬ 
den  chain,  it  is  generally  necessary  to  take  advantage  of  the  missing-variable 
nature  of  HMMs  and  to  use  simultaneous  simulation  both  of  (yt)te n  and 


There  obviously  is  no  reason  why  the  data  should  fit  this  formalized  model. 


7.5  Hidden  Markov  Models 


239 


of  the  parameters  of  the  model.  There  is,  however,  one  exception  to  that 
requirement,  which  is  revealed  in  Sect.  7.5.2,  and  that  is  when  the  state  space 
y  of  the  hidden  chain  (yt)te n  is  finite. 

In  the  event  that  both  the  hidden  and  the  observed  chains  are  on  finite 
state-spaces,  with  3/  =  {1, . . . ,  k}  and  3E  =  {1  as  in  Dnadataset, 

the  parameter  0  is  made  up  of  p  probability  vectors 

q1  =  (<?!>••  •>'2i)>--->q't  =  (qi 

and  the  parameter  5  is  the  k,  x  n  Markov  transition  matrix  P  =  (pij)  on  3/ . 
Given  that  the  joint  distribution  of  (. xt,yt)o<t<T  is 

T 

@yo  Qxo  \_  J_  \jPyt-iyt  Qxt  }  ’ 

t= l 

where  g  =  (pi,...,p^)  is  the  stationary  distribution  of  P  (i.e.,  such  that 
£>P  =  £>),  the  posterior  distribution  of  (0,5)  given  ( xt,yt)t  factorizes  as 


Hi 
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Hi 
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ir(6,  S )  gyi 


n  n<«p 

i=l  3  =  1 
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n 

i=i  j=: 


m 

ij 


where  the  n^-’s  and  the  m^-’s  are  sufficient  statistics  representing 

the  number  of  visits  to  state  j  by  the  x^s  when  the  corresponding  yt  s  are 
equal  to  i 
and 

the  number  of  transitions  from  state  i  to  state  j  on  the  hidden  chain 
(Vt)te  Nj 


respectively.  If  we  condition  on  the  starting  value  yo,  set  equal  to  1  for  (partial) 
identifi ability  reasons,  and  thus  omit  gyo  in  the  expression  above  and  if  we 
use  a  fiat  prior  on  the  p^  s  and  qy s,  the  posterior  distributions  are  Dirichlet. 
Similarly  to  the  ARM  A  case  processed  in  Chap.  7,  if  we  include  the  starting 
values  in  the  posterior  distribution,  this  introduces  a  non-conjugate  structure 
in  the  simulation  of  the  Pi/s,  but  this  can  be  handled  with  a  Metropolis- 
Hastings  substitute  that  uses  the  Dirichlet  distribution  as  the  proposal.  Note 
that,  in  the  non-conditional  case,  we  need  to  simulate  y q. 

Conditional  on  the  parameters,  the  simulation  of  the  chain  ( yt)o<t<T  can 
be  processed  Gibbs- wise  (i.e.,  one  term  at  a  time),  using  the  fully  conditional 
distributions 


P(yt  =  i\xt,yt-i,yt+i)  °c  pyt-1iPiyt+1  q 


i 

Xt 


Therefore,  the  overall  algorithm  looks  as  follows: 
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Algorithm  7.15  Finite-State  HMM  Gibbs  Sampler 

Initialization: 

1.  Generate  random  values  (or  pick  arbitrary  estimators)  of  the  pi3 's  and 
the  ql-'s. 

2.  Generate  the  hidden  Markov  chain  ( yt)o<t<T  by  (i  =  1,2) 


Pa  4  o 

Pyt-ii  Qxt 


if  t  =  0  , 
if  t  >  0  , 


and  compute  the  corresponding  sufficient  statistics. 
Iteration  m  (m  >  1): 

1.  Generate 


(Pil,  •  •  •  ,PiK)  ~  ^(1  +  nil,  •  •  • ,  1  +  niK) , 

(4,  •  •  •  ,4)  ~  ^(1  +  TOil,  •••,!+  mik)  , 

and  correct  for  the  missing  initial  probability  by  a  Metro- 
polis-Hastings  step  with  acceptance  probability  Qy0/gyo- 
2.  Generate  successively  each  yt  (0  <  t  <  T)  by 


P(yt  =  oc 


Pa  Qx  i  Pij/i 
Pyt-ii  Qxt  Piyt+i 


if  t  =  0  , 
if  £  >  0  , 


and  compute  the  corresponding  sufficient  statistics. 


In  the  initialization  step  of  Algorithm  7.15,  any  distribution  on  (yt)tew 
obviously  valid,  but  this  particular  choice  is  of  interest  since  it  is  related  to 
the  true  conditional  distribution,  simply  omitting  the  dependence  on  the  next 
value. 

The  main  loop  in  the  Gibbs  sampler  is  then  of  the  form  (for  k  =  2  and 
k  =  4  as  in  Dnadataset) 

#  Beta/Dirichlet  simulations  for  P 
a=l/ (l+rgamma(l ,nab+l) /r gamma (1 ,naa+l) ) 
b=l/ (l+rgamma(l ,nba+l) /rgamma(l ,nbb+l) ) 

P=matrix(c (a, 1-a, 1-b ,b) ,ncol=2 ,byrow=T) 

ql=rgamma(4 ,ma+l)  #  and  Q 
q2=r gamma (4 , mb+ 1 ) 
ql=ql/sum(ql) ;  q2=q2/sum(q2) 

#  (hidden)  Markov  conditioning 

x [1] =sample (1:2,1, prob=c (a*P [1 , x [2] ] *ql [y [1] ] , b*P [2 , x [2] ] 
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*q2  Cy  Cl]  ]  )  ) 
for  (m  in  2: (T-l)) 

x [m] =sample (1 : 2, 1 ,prob=c (P [x [m-1] ,  1]  *P  [1 , x [m+1] ] *ql [y [m] ] , 

P  [x  [m-1]  ,  2]  *P  [2  ,x  [m+1]  ]  *q2  [y  [m]  ] ) ) 
x [T] =sample (1 : 2 , 1 ,prob=c(P [x [T-l] ,l]*ql [y[T]] ,P[x[T-l]  ,2] 

*q2  [y  [T]  ]  )  ) 

#  Sufficient  statistics  for  next  iteration 
naa=sum( (x [1 : (T-l) ] ==1) * (x [2 : T] ==1) ) 
nab=sum( (x [1 : (T-l) ] ==1) * (x [2 : T]  ==2) ) 
nba=sum( (x [1 : (T-l) ] ==2) * (x [2 : T] ==1) ) 
nbb=sum ( (x [1 : (T-l) ] ==2) * (x [2 : T] ==2) ) 
ya=y [x==l] 

ma=c (sum(ya==l) , sum(ya==2) , sum(ya==3) , sum(ya==4) ) 
yb=y  [x==2] 

mb=c (sum(yb==l) ,sum(yb==2) , sum (yb==3) , sum(yb==4) ) 

We  ran  several  Gibbs  samplers  for  1,000  iterations,  starting  from  small, 
medium  and  high  values  for  pn  and  £>22,  and  got  very  similar  results  in  both 
first  and  both  last  cases  for  the  approximations  to  the  Bayes  posterior  means, 
as  shown  by  Table  7.1.  The  raw  output  also  gives  a  sense  of  stability,  as  shown 
by  Fig.  7.9. 

For  the  third  case,  started  at  small  values  of  both  pu  and  £>22,  the  sim¬ 
ulated  chain  had  not  visited  the  same  region  of  the  posterior  distribution 
after  those  1,000  iterations,  and  it  produced  an  estimate  with  a  smaller  log- 
likelihood1  value  of  —13,160.  However,  running  the  Gibbs  sampler  longer 
(for  4,000  more  iterations)  did  produce  a  similar  estimate,  as  shown  by  the 
third  replication  in  Table  7.1.  This  phenomenon  is  slightly  related  to  the  phe¬ 
nomenon,  discussed  in  the  context  of  Figs.  6.4  and  6.3,  that  the  Gibbs  sampler 
tends  to  “stick”  to  lower  modes  for  lack  of  sufficient  energy.  In  the  current 
situation,  the  energy  required  to  leave  the  lower  mode  appears  to  be  available. 
Note  that  we  have  reordered  the  output  to  compensate  for  a  possible  switch 
between  hidden  states  1  and  2  among  experiments.  This  is  quite  natural,  given 
the  lack  of  identifiability  of  the  hidden  states  (Exercise  7.17).  Flipping  the  in¬ 
dices  1  and  2  does  not  modify  the  likelihood,  and  thus  all  these  experiments 
explore  the  same  mode  of  the  posterior. 


7.5.2  Forward— Backward  Representation 

When  the  state  space  of  the  hidden  Markov  chain  y  is  finite,  that  is,  when 

y  =  {1, •••,«} , 

1 7  The  log-posterior  is  proportional  to  the  log-likelihood  in  that  special  case,  and 
the  log-likelihood  is  computed  using  a  technique  described  below  in  Sect.  7.5.2. 
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Fig.  7.9.  Dataset  Dnadataset:  Convergence  of  a  Gibbs  sequence  to  the  region  of 
interest  on  the  posterior  surface  for  the  hidden  Markov  model  (this  is  replication  2 
in  Table  7.1).  The  row- wise  order  of  the  parameters  is  the  same  as  in  Table  7.1 


Table  7.1.  Dataset  Dnadataset:  Five  runs  of  the  Gibbs  sampling  approximations 
to  the  Bayes  estimates  of  the  parameters  for  the  hidden  Markov  model  along  with 
final  log-likelihood  (starting  values  are  indicated  on  the  line  below  in  parentheses) 
based  on  M  —  1000  iterations  (except  for  replication  3,  based  on  5,000  iterations) 


Run 

pn 

P22 

Qi 

Q2 

Q3 

Qi 

Q2 

(& 

Log- like 

1 

0.720 

0.581 

0.381 

0.032 

0.396 

0.306 

0.406 

0.018 

-13,121 

(0.844) 

(0.885) 

(0.260) 

(0.281) 

(0.279) 

(0.087) 

(0.094) 

(0.0937) 

2 

0.662 

0.620 

0.374 

0.016 

0.423 

0.317 

0.381 

0.034 

-13,123 

(0.628) 

(0.621) 

(0.203) 

(0.352) 

(0.199) 

(0.066) 

(0.114) 

(0.0645) 

3 

0.696 

0.609 

0.376 

0.023 

0.401 

0.318 

0.389 

0.030 

-13,118 

(0.055) 

(0.150) 

(0.293) 

(0.200) 

(0.232) 

(0.150) 

(0.102) 

(0.119) 

4 

0.704 

0.580 

0.377 

0.024 

0.407 

0.313 

0.403 

0.020 

-13,121 

(0.915) 

(0.610) 

(0.237) 

(0.219) 

(0.228) 

(0.079) 

(0.073) 

(0.076) 

5 

0.694 

0.585 

0.376 

0.0218 

0.410 

0.315 

0.395 

0.0245 

-13,119 

(0.600) 

(0.516) 

(0.296) 

(0.255) 

(0.288) 

(0.110) 

(0.095) 

(0.107) 

the  likelihood  function  8  of  the  observed  process  (xt)i <t<T  can  be  computed 
in  a  manageable  0(T  x  k, 2)  time  by  a  recurrence  relation  called  the  forward- 


18 To  lighten  notation,  we  will  not  use  the  parameters  appearing  in  the  various 
distributions  of  the  HMM,  even  though  they  are  obviously  of  central  interest. 
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backward  or  Baum-Welch  formulas.19  We  now  explain  how  those  formulas  are 
derived. 

As  illustrated  in  Fig.  7.7,  a  generic  feature  of  HMMs  is  that  (t  =  2, . . . ,  T) 

p{yt\yt-i,*o-.T)  =  p(yt\yt-i,xt-.T)  ■ 

In  other  words,  knowledge  of  the  past  observations  is  redundant  for  the  distri¬ 
bution  of  the  hidden  Markov  chain  when  we  condition  on  its  previous  value. 
Therefore,  when  y  is  finite,  we  can  write  that 


PidJT  \yT—l  i  X0:T  )  Vvt-WtJ^T^t)  =  Vt  (iJT \yT-l ,  X0:t)  , 


meaning  that  we  define  xo:t)  as  the  unnormalized  version  of  the 

density  xo:t)-  Then  we  can  process  backward  the  definition  of  the 

previous  conditionals,  so  that  (1  t  T) 


Hi 


p(yt\yt-i,x0-T )  =  ^2p(yt,yt+ 1  =  i\yt-i,xUT) 


i— 1 

ft 


oc  ^2p(yt,yt+ 1  =  i,xUT\yt-i) 


2=1 

K 


=  ^2p(yt\yt-i)f(xt\yt)p(yt+i  =  i,^t+1):T\yt) 


i— 1 


ft 


oc  Pyt-iVtf{xt\yt)^2p{yt+i  =  i|yt,x(t+1):T) 


2=1 

Hi 


oc  Pyt_iyt/(a;t|yt)  y^p*+i(i|2/t,xi:T)  =  p*(yt|yt-i,xi:T) 


2=1 


At  last,  the  conditional  distribution  of  the  first  hidden  value  ?/o  is 


p(j/o|xo:t)  OC  0J/O  f(x0\yo)  y^K(«|2/o,x0:t)  =_Po(yo|x0:r) , 


2=1 


where  (pfe)/c  is  the  stationary  distribution  associated  with  the  Markov  tran¬ 
sition  matrix  P.  (This  is  unless  the  first  hidden  value  yo  is  automatically  set 
equal  to  1  for  identifiability  reasons.) 

While  this  construction  amounts  to  a  straightforward  conditioning  argu¬ 
ment,  the  use  of  the  unnormalized  functions  p*+1(yt+ 1  =  i\yt^{i-.r)  is  crucial 
for  deriving  the  joint  conditional  distribution  of  ?/i:t  since  resorting  to  the 
normalized  conditionals  instead  would  result  in  a  useless  identity. 

19  This  recurrence  relation  has  been  known  for  quite  a  while  in  the  signal  processing 
literature  and  is  also  used  in  the  corresponding  EM  algorithm;  see  Cappe  et  al.  (2004) 
for  details. 
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Notice  that,  as  stated  above,  the  derivation  of  the  p^s  indeed  has  a  cost  of 
O (T  x  ft2)  since,  for  each  t  and  each  of  the  n  values  of  yt,  a  sum  of  n  terms  has  to 
be  computed.  So,  in  terms  of  raw  computational  time,  computing  the  observed 
likelihood  does  not  take  less  time  than  simulating  the  sequence  (yt)te n  in  the 
Gibbs  sampler.  However,  the  gain  in  using  this  forward-backward  formula 
may  impact  in  subtler  ways  a  resulting  Metropolis-Hastings  algorithm,  such 
as  a  better  mixing  of  the  chain  of  the  parameters,  given  that  we  are  simulating 
the  whole  vector  at  once. 

Once  we  have  all  the  conditioning  functions  (or  backward  equations),  it 
is  possible  to  simulate  sequentially  the  hidden  sequence  yo .t  given  xq :t  by 
generating  first  2/0  from  p(2/o|xO:t)?  second  yi  from  p(y\ |yo? x0:t)  and  so  on. 
However,  there  is  (much)  more  to  be  done.  Indeed,  when  considering  the  joint 
conditional  distribution  of  yo :t  given  xq:t,  we  have 


T 


P(y0:x|x0:x)  =  P(yo|xO:x)  ff  l^t-l ,  x0:x) 


t=  1 


n(yi)f(xo\yo)  T~T  Pyt-iytf  (xt\yt)  EEl  Pt+1  C\yti  X1:X ) 


E  i=i  pU^o-.t) 


n 

t= 1 


E»K=lPt(*|yt-l>X(l:x) 


T 


hi 


^(yo)f(x0\y0)'[[pyt_1yJ{xt\yt)  /  y^Pi(i|x0:T) 


t=  1 


i— 1 


since  all  the  other  sums  cancel.  This  joint  conditional  distribution  immediately 
leads  to  the  derivation  of  the  observed  likelihood  since,  by  Bayes’  formula, 


/(X0:T) 


/(xQ:r|yi:r)p(yo:r) 

P{y 0:T  |X0:T ) 


hi 


52^i(*IX0:t) 


i— 1 


1 


which  is  the  normalizing  constant  of  the  initial  conditional  distribution! 
Therefore,  working  with  the  unnormalized  densities  has  this  supplementary 
advantage  to  provide  an  approximation  to  the  observed  likelihood.  (Keep  in 
mind  that  all  the  expressions  above  implicitly  depend  on  the  model  parame¬ 
ters.) 

A  forward  derivation  of  the  likelihood  can  similarly  be  constructed.  Besides 
the  obvious  construction  that  is  symmetrical  to  the  previous  one,  consider  the 
so-called  prediction  filter 


ipt(i)  =  P  (yt 


1 


with  <fi(j)  =  7 r(j)  (where  the  term  prediction  refers  to  the  conditioning  on 
the  observations  prior  to  time  t).  The  forward  equations  are  then  given  by 
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where 

Hi 

ct  =  ^ ~2f(xt\yt  =  k)ipt(k) 

k=l 

is  the  normalizing  constant.  (This  formula  uses  exactly  the  same  principle  as 
the  backward  equations.)  Exploiting  the  Markov  nature  of  the  joint  process 
we  can  then  derive  the  log-likelihood  as 


logp(xi:t)  =  El0S 


r— 1 
t 


Hi 


^2p{xt,yt  =  *|a;i:(r-l 


_i— 1 

Hi 


r— 1  Li=l 


5 ~2f(xr\yt  =  i)Pr(i) 


which  also  requires  a  0(T  X  ft2)  computational  time. 

The  resulting  R  function  for  computing  the  (observed)  likelihood  is  there¬ 
fore  (for  ft  =  2  and  k  =  4  as  in  Dnadataset) 

likej=function(vec , log=TRUE) { 

#  vec  is  the  aggregated  parameter  vector 

P=matrix(c (vec  [1] , 1-vec  [1] , 1-vec  [2] , vec  [2] ) ,ncol=2 ,byrow= 
TRUE) 

Ql=vec  [3:6] ;  Q2=vec  [7 : 10] 


pxy=c  (P  [1 , 1]  ,  P  [2 , 2]  ) 

pxy=pxy/sum(pxy)  #  stationary  distribution  of  P 
pyy=rep(l ,T) 

pyy  [1]  =pxy  [1]  *Q1  [y  [1]  ]  +pxy  [2]  *Q2  [y  [1]  ] 
for  (t  in  2:T){ 

pxy=pxy  [1] *Q1  [y  [t-1] ] *P [1 , ] +pxy [2] *Q2 [y [t-1]  ]  *P [2 ,  ] 
pxy=pxy/sum (pxy) 

pyy  [t]  =  (pxy  [1]  *Q1  [y  [t]  ]  +pxy  [2]  *Q2  [y  [t]  ] ) 

} 


if  (log){ 
ute=sum(log(pyy) ) 

} 

else{ 

ute=prod(pyy) 

> 

ute 

} 

Obviously,  to  be  able  to  handle  directly  the  observed  likelihood  when 
T  is  reasonable  opens  new  avenues  for  simulation  methods.  For  instance, 
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the  completion  step  (of  simulating  the  hidden  Markov  chain)  is  no  longer 
necessary,  and  Metropolis-Hastings  alternatives  such  as  random- walk  propos¬ 
als  can  be  used. 


Returning  to  Dnadataset,  we  can  compute  the  log-likelihood  (and  hence 
the  posterior  up  to  a  normalizing  constant)  associated  with  a  given  parameter 
using,  for  instance,  the  prediction  filter.  In  that  case, 


T 

logp(xi:T)  =  ^2  log 

t= 1 


"  k 

(i) 

_i— 1 


5 


where  ipt(j)  oc  Y^i=i  QxtPtfyPij •  This  representation  of  the  log-likelihood  is 
used  in  the  computation  given  above  for  the  Gibbs  sampler. 

Furthermore,  given  that  all  parameters  to  be  simulated  are  probabilities, 
using  a  normal  random  walk  proposal  in  the  Metropolis-Hastings  algorithm 
is  not  adequate.  Instead,  a  more  appropriate  proposal  is  based  on  Dirichlet 
distributions  centered  at  the  current  value,  with  scale  factor  a  >  0;  that  is 
6  =  1,2), 


Pjj  ^  3$e(apjj ,  a(l  —  Pjj))  ~  , . . . ,  aqA) . 

The  Metropolis-Hastings  acceptance  probability  is  then  the  ratio  of  the  like¬ 
lihoods  over  the  ratio  of  the  proposals,  f  (0\6')  /  f  (O'  \6) .  Since  larger  values  of 
a  produce  more  local  moves,  we  could  test  a  range  of  values  to  determine  the 
“proper”  scale.  However,  this  requires  a  long  calibration  step.  Instead,  the 
algorithm  can  take  advantage  of  the  different  scales  by  picking  at  random  for 
each  iteration  a  value  of  a  from  among  1,  10,  100,  10,000  or  100,000.  (The 
randomness  in  a  can  then  be  either  ignored  in  the  computation  of  the  proposal 
density  /  or  integrated  by  a  Rao-Blackwell  argument.)  For  Dnadataset,  this 
range  of  cbs  was  wide  enough  since  the  average  probability  of  acceptance  is 
0.25  and  a  chain  (0m)m  started  at  random  does  converge  to  the  same  values 
as  the  Gibbs  chains  simulated  above,  as  shown  by  Fig.  7.10,  which  also  indi¬ 
cates  that  more  iterations  would  be  necessary  to  achieve  complete  stability. 
We  can  note  in  particular  that  the  maximum  log-posterior  value  found  along 
the  iterations  of  the  Metropolis-Hastings  algorithm  is  —13,116,  which  is  larger 
than  the  values  found  in  Table  7.1  for  the  Gibbs  sampler,  for  parameter  values 
of  (0.70, 0.58, 0.37, 0.011, 0.42, 0.19, 0.32, 0.42, 0.003, 0.26). 


When  the  state  space  y  is  finite,  it  may  be  of  interest  to  estimate  the 
order  of  the  hidden  Markov  chain.  For  instance,  in  the  case  of  Dnadataset, 
it  is  relevant  to  infer  on  how  many  hidden  coding  states  there  are.  A  possible 
approach,  not  covered  here,  is  to  use  a  reversible  jump  MCMC  algorithm  that 
resemble  very  much  the  reversible  jump  algorithm  for  the  mixture  model.  The 
reference  in  this  direction  is  Cappe  et  al.  (2004,  Chap.  16)  where  the  authors 
construct  a  reversible  jump  algorithm  in  this  setting.  However,  the  availability 


7.5  Hidden  Markov  Models  247 


0  1000  2000  3000  4000  5000 


0  1000  2000  3000  4000  5000  0  1000  2000  3000  4000  5000 


q?2 


q2. 


Fig.  7.10.  Dataset  Dnadataset:  Convergence  of  a  Metropolis-Hastings  sequence 
for  the  hidden  Markov  model  based  on  5,000  iterations.  The  overlayed  curve  in  the 
background  is  the  sequence  of  log-posterior  values 


of  the  (observed)  likelihood  means  that  the  marginal  solution  of  Chib  (1995), 
exposed  in  Chap.  6  (Sect.  6.8)  for  the  mixtures  of  distributions  also  applies  in 
the  current  setting  (Exercise  7.19). 


Fig.  7.11.  DAG  representation  of  the  dependence  structure  of  a  Markov-switching 
model  where  (xt)t  is  the  observable  process  and  (yt)t  is  the  hidden  chain 


The  model  first  introduced  for  Dnadataset  is  overly  simplistic  in  that,  at 
least  within  the  coding  regime,  the  x^s  are  not  independent.  A  more  realistic 
modeling  thus  assumes  that  the  XfS  constitute  a  Markov  chain  within  each 
state  of  the  hidden  chain,  resulting  in  the  dependence  graph  of  Fig.  7.11.  To 
distinguish  this  case  from  the  earlier  one,  it  is  often  called  Markov-switching. 
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This  extension  is  much  more  versatile  than  the  model  of  Fig.  7.7,  and  we  can 
hope  to  capture  the  time  dependence  better.  However,  it  is  far  from  parsimo¬ 
nious,  as  the  use  of  different  Markov  transition  matrices  for  each  hidden  state 
induces  an  explosion  in  the  number  of  parameters.  For  instance,  if  there  are 
two  hidden  states,  the  number  of  parameters  is  26;  if  there  are  four  hidden 
states,  the  number  jumps  to  60. 


7.6  Exercises 

7.1  Consider  the  process  (xt)te z  defined  by 

xt  =  a  +  bt  +  yt , 

where  (yt)tez  is  an  iid  sequence  of  random  variables  with  mean  0  and  variance  a2,  and 
where  a  and  b  are  constants.  Define 

wt  =  (2 q  +  • 

Compute  the  mean  and  the  autocovariance  function  of  (wt)tez-  Show  that  (wt)te z  is 
not  stationary  but  that  its  autocovariance  function  7 w(t  +  h,t)  does  not  depend  on  t. 

7.2  Suppose  that  the  process  (xt)teN  is  such  that  xo  ~  tF(0,t2)  and,  for  all  t  G  N, 

2 

27+i|x0:t  ~  TF(xt/2,  a  ),  cr>0. 

Give  a  necessary  condition  on  r2  for  (xt)teN  to  be  a  (strictly)  stationary  process. 

7.3  Suppose  that  (xt)teN  is  a  Gaussian  random  walk  on  R:  xo  c/F(0,  r2)  and,  for 

all  t  e  N, 

2 

Xt  +  l  |x0:t  ~  JY(xt,  cr  )  ,  CT  >  0  . 

Show  that,  whatever  the  value  of  r2  is,  (xt)te®  's  n°t  a  (strictly)  stationary  process. 

7.4  Give  the  necessary  and  sufficient  condition  under  which  an  AR(2)  process  with 
autoregressive  polynomial  V(u)  —  1  —  giu  —  Q2U2  (with  Q2  /  0)  is  causal. 

7.5  Consider  the  process  (xt)te®  such  that  xo  =  0  and,  for  all  £  £  N, 

2 

27+l|x0:i  ~  H/( QXt:(J  )  . 

Suppose  that  1 r(g,a)  =  1/a  and  that  there  is  no  constraint  on  g.  Show  that  the 
conditional  posterior  distribution  of  g ,  conditional  on  the  observations  xo :t  and  on  a  , 
is  a  distribution  with 

T  /  T  1  T 

=  ^2Xt~lXt  /  ^2xt-i  and  Ut  =  ci1  y^x2_i  • 
t— 1  '  t— 1  '  t— 1 

Show  that  the  marginal  posterior  distribution  of  g  is  a  Student  ^(T  —  1 , dis¬ 
tribution  with 
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Apply  this  modeling  to  the  Aegon  series  in  Eurostoxx50  and  evaluate  its  predictive 
abilities. 

7.6  For  Algorithm  7.13,  show  that,  if  the  proposal  on  a  is  a  log-normal  distribu¬ 
tion  A'' (log  (a2_i),  t2)  and  if  the  prior  distribution  on  a2  is  the  noninformative  prior 
n(a2)  =  1/a2 ,  the  acceptance  ratio  also  reduces  to  the  likelihood  ratio  because  of  the 
Jacobian. 

7.7  Write  down  the  joint  distribution  of  (yt,xt)te n  'n  (7.19)  and  deduce  that  the 
(observed)  likelihood  is  not  available  in  closed  form. 

7.8  Show  that  the  stationary  distribution  of  x_p:_i  in  an  AR (p)  model  is  a  eyfp(/xlp,  A) 
distribution,  and  give  a  fixed  point  equation  satisfied  by  the  covariance  matrix  A. 

7.9  Show  that  the  posterior  distribution  on  6  associated  with  the  prior  n  (0)  =  l/<72 
and  an  AR (p)  model  is  well-defined  for  T  >  p  observations. 

7.10  Show  that  the  coefficients  of  the  polynomial  V  in  (7.5)  associated  with  an  AR (p) 
model  can  be  derived  in  0(p2)  time  from  the  inverse  roots  A*  using  the  recurrence 
relations  (i  —  1, . . .  ,p,  j  =  0, . . .  ,  p) 

^o  =  l, 

where  Ao  =  1  and  A)  =  0  for  j  >  i,  and  setting  Qj  =  ~AVj  ( j  —  1, . . .  ,p). 

7.11  Given  the  polynomial  V  in  (7.5),  the  fact  that  all  the  roots  are  outside  the  unit 
circle  can  be  determined  without  deriving  the  roots,  thanks  to  the  Schur-Cohn  test.  If 
AP  =  V,  a  recursive  definition  of  decreasing  degree  polynomials  is  (k  =  p, . . . ,  1) 

uAk-i(u)  =  Ak-i(u)  -  ( pkAl(u ) , 

where  A*k  denotes  the  reciprocal  polynomial  Ak{u)  —  ukAk-i(l/u). 

1.  Given  the  expression  of  (fk  in  terms  of  the  coefficients  of  Ak- 

2.  Show  that  the  degree  of  Ak  is  at  most  k. 

3.  If  amjfc  denotes  the  ra-th  degree  coefficient  in  Ak,  show  that  cik,k  A  0  for  k  = 
0, . . .  ,p  if,  and  only  if,  ao,fc  A  ak,k  for  all  k’s. 

4.  Check  by  simulation  that,  in  cases  when  ak,k  A  0  for  k  =  0, . . .  ,p,  the  roots  are 
outside  the  unit  circle  if,  and  only  if,  all  the  coefficients  cik,k  are  positive. 

7.12  For  an  MA(g)  process,  show  that  (s  <  q) 

Q~\s\ 

7 x(s)  =  o'2  E  • 

i—0 

7.13  Show  that  the  conditional  distribution  of  (eo, . . . ,  e-9+i)  given  both  xi ,t  and 
the  parameters  is  a  normal  distribution.  Evaluate  the  complexity  of  computing  the  mean 
and  covariance  matrix  of  this  distribution. 

7.14  Give  the  conditional  distribution  of  e~t  given  the  other  e-i's,  xi ,t,  and  the  e^’s. 
Show  that  this  distribution  only  depends  on  the  other  e_;’s,  xi:q_t+i,  and  ei:q_t+i. 
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7.15  Show  that  the  (useful)  predictive  horizon  for  the  MA(g)  model  is  restricted  to 
the  first  q  future  observations  xt+i ■ 

7.16  Show  that  the  system  of  equations  given  by  (7.13)  and  (7.14)  induces  a  Markov 
chain  on  the  completed  variable  (xt,y*).  Deduce  that  state-space  models  are  special 
cases  of  hidden  Markov  models. 


7.17  Show  that,  for  a  hidden  Markov  model,  when  the  support  y  is  finite  and  when 
(yt)te n  is  stationary,  the  marginal  distribution  of  xt  is  the  same  mixture  distribution  for 
all  t's.  Deduce  that  the  same  identifiability  problem  as  in  mixture  models  occurs  in  this 
setting. 

7.18  Given  a  hidden  Markov  chain  (xt,yt)  with  both  xt  and  yt  taking  a  finite  number  of 
possible  values,  k  and  k,  show  that  the  time  required  for  the  simulation  of  T  consecutive 
observations  is  in  O (knT). 

7.19  Implement  Chib’s  method  of  Sect.  6.8  in  the  case  of  a  doubly  finite  hidden  Markov 
chain.  First,  show  that  an  equivalent  to  the  approximation  (6.9))  is  available  for  the 
denominator  of  (6.8).  Second,  discuss  whether  or  not  the  label  switching  issue  also  rises 
in  this  framework.  Third,  apply  this  approximation  to  Dnadataset. 


7.20  Show  that  the  counterpart  of  the  prediction  filter  in  the  Markov-switching  case 
is  given  by 


t 

l0gp(xi:t)  =  y>g 

r—  1 


K, 


_i—  1 


where  ipr(i)  —  P( yr 


Xl;r-l) 


is  given  by  the  recursive  formula 


K, 

ipr(i)  oc  E  Pjif  (Xr-l 


Xr—2,yr—l  —  j)(pi  —  !  (j)  • 
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“Reduce  it  to  binary,  Siobhan,”  she  told  herself. 

— Ian  Rankin,  Resurrection  Men. — 


Roadmap 

This  final  chapter  covers  the  analysis  of  pixelized  images  through  Markov  random 
field  models,  towards  pattern  detection  and  image  correction.  We  start  with  the 
statistical  analysis  of  Markov  random  fields,  which  are  extensions  of  Markov  chains 
to  the  spatial  domain,  as  they  are  instrumental  in  this  chapter.  This  is  also  the 
perfect  opportunity  to  cover  the  ABC  method,  as  these  models  do  not  allow  for 
a  closed  form  likelihood.  Image  analysis  has  been  a  very  active  area  for  both 
Bayesian  statistics  and  computational  methods  in  the  past  30  years,  so  we  feel  it 
well  deserves  a  chapter  of  its  own  for  its  specific  features. 


J.-M.  Marin  and  C.P.  Robert,  Bayesian  Essentials  with  R ,  Springer  Texts 
in  Statistics,  DOI  10. 1007/978- 1-4614-8687-9_8, 
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8.1  Image  Analysis  as  a  Statistical  Problem 

If  we  think  of  a  computer  image  as  a  (large)  collection  of  colored  pixels 
disposed  on  a  grid,  there  does  not  seem  to  be  any  randomness  involved  nor  any 
need  for  statistical  analysis!  Nonetheless,  image  analysis  seen  as  a  statistical 
analysis  is  a  thriving  field  that  saw  the  emergence  of  several  major  statisti¬ 
cal  advances,  including,  for  instance,  the  Gibbs  sampler.  (Moreover,  this  field 
has  predominantly  adopted  a  Bayesian  perspective  both  because  this  was  a 
natural  thing  to  do  and  because  the  analytical  power  of  this  approach  was 
higher  than  with  other  methods.)  The  reason  for  this  apparent  paradox  is 
that,  while  pixels  usually  are  deterministic  objects,  the  complexity  and  size 
of  images  require  one  to  represent  those  pixels  as  the  random  output  of  a 
distribution  governed  by  an  object  of  much  smaller  dimension.  For  instance, 
this  is  the  case  in  computer  vision,  where  specific  objects  need  to  be  extracted 
out  of  a  much  richer  (or  noisier)  background. 

In  this  spirit  of  extracting  information  from  huge  dimensional  structure,  we 
thus  build  in  Sect.  8.2  a  specific  family  of  distributions  inspired  from  particle 
physics,  the  Potts  model,  in  order  to  structure  images  and  other  spatial  struc¬ 
tures  in  terms  of  local  homogeneity.  Unfortunately,  this  is  a  mostly  theoretical 
section  with  very  few  illustrations.  In  Sect.  8.3,  we  address  the  fundamen¬ 
tal  issue  of  handling  the  missing  normalizing  constant  in  these  models  by 
introducing  a  new  computational  technique  called  ABC  that  operates  on  in¬ 
tractable  likelihoods  (with  the  penalty  of  producing  an  approximative  answer). 
In  Sect.  8.4,  we  impose  a  strong  spatial  dimension  on  the  prior  associated  with 
an  image  in  order  to  gather  homogeneous  structures  out  of  a  complex  or  blurry 
image. 


8.2  Spatial  Dependence 

8.2.1  Grids  and  Lattices 

An  image  (in  the  sense  of  a  computer  generated  image)  is  a  special  case  of 
a  lattice ,  in  the  sense  that  it  is  a  random  object  whose  elements  are  indexed 
by  the  location  of  the  pixels  and  are  therefore  related  by  the  geographical 
proximity  of  those  locations.  In  full  generality,  a  lattice  is  a  mathematical 
multidimensional  object  on  which  a  neighbourhood  relation  can  be  defined. 
Even  though  the  original  analysis  of  lattice  models  by  Besag  (1974)  focussed 
on  plant  ecology  and  agricultural  experiments,  the  neighbourhood  relation  is 
only  constrained  to  be  a  symmetric  relation  and  it  does  not  necessarily  have 
a  connection  with  a  geographical  proximity,  nor  with  an  image.  For  instance, 
the  relation  can  describe  social  interactions  between  Amazon  tribes  or  words 
in  a  manuscript  sharing  a  linguistic  root.  (The  neighbourhood  relation  be¬ 
tween  two  points  of  the  lattice  is  generally  translated  in  statistical  terms  into 
a  probabilistic  dependence  between  those  points.)  The  lattice  associated  with 
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an  image  is  a  regular  n  x  m  array  made  of  (i,  j ) ’ s  (1  <  i  <  n,  1  <  j  <  m), 
whose  nearest  (but  not  necessarily  only)  neighbors  are  made  of  the  four  en¬ 
tries  (i,  j  —  1),  (i,  j  +  1),  (i  —  1,  j)  and  (i  +  l,j).  In  order  to  properly  describe 
a  dependence  structure  in  images  or  in  other  spatial  objects  indexed  by  a 
lattice,  we  need  to  expand  the  notion  of  Markov  chain  on  those  structures. 
Since  a  lattice  is  a  multidimensional  object — as  opposed  to  the  unidimen¬ 
sional  line  corresponding  to  the  times  of  observation  of  the  Markov  chain — , 
a  first  requirement  for  the  generalization  is  to  define  a  proper  neighbourhood 
structure. 


In  order  to  illustrate  this  notion,  we  consider  a  small  dataset1  depicting 
the  presence  of  tufted  sedges2  in  a  part  of  a  wetland.  This  dataset,  called 
Laichedata,  is  simply  a  25  x  25  matrix  of  zeroes  and  ones.  The  corresponding 
lattice  is  the  25  x  25  array  (Fig.  8.1). 


Fig.  8.1.  Presence/absence  of  the  tufted  sedge  plant  ( Carex  data)  on  a  rectangular 
patch 


Given  a  lattice  X  of  sites  i  G  I  on  a  map  or  of  pixels  in  an  image,3  a 
neighbourhood  relation  on  X  is  denoted  by  i  ~  j  meaning  that  i  and  j 

are  neighbors.  If  we  associate  a  probability  distribution  on  a  vector  x  indexed 
by  the  lattice,  x  =  (a^)^z,  with  this  relation,  meaning  that  two  components 
Xi  and  Xj  are  correlated  if  the  sites  i  and  j  are  neighbors,  a  fundamental 

1  Taken  from  Gaetan  and  Guyon  (2010),  kindly  provided  by  the  authors. 

2  Wikipedia:  “  Carex  is  a  genus  of  plants  in  the  family  Cyperaceae ,  commonly 
known  as  sedges.  Most  (but  not  all)  sedges  are  found  in  wetlands,  where  they  are 
often  the  dominant  vegetation.”  Laiche  is  the  French  for  sedge. 

3We  will  indiscriminately  use  site  and  pixel  in  the  remainder  of  the  chapter. 
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requirement  for  the  existence  of  this  distribution  is  that  the  neighbourhood 
relation  is  symmetric  (Cressie,  1993):  if  i  is  a  neighbor  of  j  (written  as  i  ~  j), 
then  j  is  a  neighbor  of  i.  (By  convention,  i  is  not  a  neighbor  of  itself.)  Figure  8.2 
illustrates  this  notion  for  three  types  of  neighborhoods  on  a  regular  grid. 
For  instance,  Laichedata  could  be  associated  with  a  northwest-southeast 
neighbourhood  to  account  for  dominant  winds:  an  entry  (i,j)  would  have  as 
neighbors  (i  —  1,  j  —  1)  and  (i  +  1,  j  +  1). 


ibO 

O 


n 


Fig.  8.2.  Some  common  neighbourhood  structures  used  in  imaging,  with  four 
(upper  left),  eight  ( upper  right),  or  twelve  neighbors  (lower) 


8.2.2  Markov  Random  Fields 

A  random  field  on  X  is  a  random  structure  indexed  by  the  lattice  X,  a  col¬ 
lection  of  random  variables  {xpi  E  X}  where  each  X{  takes  values  in  a  finite 
set  x-  Obviously,  the  interesting  case  is  when  the  xfs  are  dependent  random 
variables  in  relation  with  the  neighbourhood  structure  on  X. 

If  n(i)  is  the  set  of  neighbors  of  i  E  X  and  if  =  {xpi  E  A}  denotes 
the  subset  of  x  for  indices  in  a  subset  A  C  X,  then  xn(q  is  the  set  of  values 
taken  by  the  neighbors  of  i.  The  extension  from  a  Markov  chain  to  a  Markov 
random  field  then  assumes  only  dependence  on  the  neighbors.4  More  precisely, 
if,  as  before,  we  denote  by  x_^  =  {xp  i  £  A}  the  coordinates  that  are  not  in 

4This  dependence  immediately  forces  the  neighbourhood  relation  to  be 
symmetric. 
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a  given  subset  A  C  X,  a  random  field  is  a  Markov  random  field  (MRF)  if  the 
conditional  distribution  of  any  pixel  given  the  other  pixels  only  depends  on 
the  values  of  the  neighbors  of  that  pixel;  i.e.,  for  i  Gl, 


7r(a;j|x_j)  =  7r(x, 


X 


n(i 


Markov  random  fields  have  been  used  for  quite  a  while  in  imaging, 
not  necessarily  because  images  obey  Markov  laws  but  rather  because  these 
dependence  structures  offer  highly  stabilizing  properties  in  modeling.  Indeed, 
constructing  the  joint  prior  distribution  of  an  image  is  a  daunting  task  because 
there  is  no  immediate  way  of  describing  the  global  properties  of  an  image  via 
a  probability  distribution.  Just  as  for  the  directed  acyclic  graphs  (DAG)  mod¬ 
els  at  the  core  of  the  BUGS  software,  using  the  full  conditional  distributions 
breaks  the  problem  down  to  a  sequence  of  local  problems  and  this  is  therefore 
more  manageable  in  the  sense  that  we  may  be  able  to  express  more  clearly 
how  we  think  X{  behaves  when  the  configuration  of  its  neighbors  is  known. 

Before  launching  into  the  use  of  specific  MRFs  to  describe  prior  assump¬ 
tions  on  a  given  lattice,  we  need  to  worry5 6  about  the  very  existence  of  MRFs! 
Indeed,  defining  a  set  of  full  conditionals  does  not  guarantee  that  there  is  a 
joint  distribution  behind  them  (Exercise  8.1).  In  our  case,  this  means  that  gen¬ 
eral  forms  of  neighborhoods  and  general  types  of  dependences  on  the  neighbors 
do  not  usually  correspond  to  a  joint  distribution  on  x. 

We  first  obtain  a  representation  that  can  be  used  for  testing  the  existence 
of  a  joint  distribution.  Starting  from  a  complete  set  of  full  conditionals  on  a 
lattice  X,  if  there  indeed  exists  a  corresponding  joint  distribution,  i r(x),  it  is 
completely  defined  by  the  ratio  7r(x)/7r(x*)  for  a  given  fixed  value  x*  since 
the  normalizing  constant  is  automatically  determined.  Now,  if  X  =  {1, . . . ,  n}, 
it  is  simple  to  exhibit  a  full  conditional  density  within  the  joint  density  by 
writing  the  natural  decomposition 


7 r(x)  =  7r(xi  |x_i)7t(x_i) 


and  then  to  introduce  x*  by  the  simple  divide-and-multiply  trick 


7 r 


?[(■£ jjx-i) 

7r(xJ|x_i) 


7r(x*,X_i)  . 


If  we  iterate  this  trick  for  all  terms  in  the  lattice  (assuming  we  never  divide 
by  0),  we  eventually  get  to  the  representation 


5It  is  no  surprise  that  computational  techniques  such  as  the  Gibbs  sampler 
stemmed  from  this  area,  as  the  use  of  conditional  distributions  is  deeply  ingrained 
in  the  imaging  community. 

For  those  that  do  not  want  nor  do  not  need  to  worry,  the  end  of  this  section 
can  be  skipped,  it  being  of  a  more  theoretical  nature  and  not  used  in  the  rest  of  the 
chapter. 
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Hence,  we  can  truly  write  the  joint  density  as  a  product  of  ratios  of  its  full 
conditionals  modulo  one  renormalization.7 

This  result  can  also  be  used  toward  our  purpose  of  checking  for  compatibil¬ 
ity  of  the  full  conditional  distributions:  if  there  exists  a  joint  density  such  that 
the  full  conditionals  never  cancel,  then  (8.1)  must  hold  for  every  representa¬ 
tion  of  X  =  {1, . . . ,  n};  that  is,  for  every  ordering  of  the  indices,  and  for  every 
choice  of  reference  value  x*.  Although  we  cannot  provide  here  the  reasoning 
behind  the  result,  there  exists  a  necessary  and  sufficient  condition  for  the  exis¬ 
tence  of  an  MRF.  This  condition  relies  on  the  notion  of  clique :  Given  a  lattice 
X  and  a  neighbourhood  relation  a  clique  is  a  maximal  subset  of  X  made 
of  sites  that  are  all  neighbors.  The  corresponding  existence  result  (Cressie, 
1993)  is  that  an  MRF  associated  with  X  and  the  neighbourhood  relation  ~ 
necessarily  is  of  the  form 

7r(x)  ocexp  -]T  <?c(xc)  I  ,  (8.2) 

V  ce ^  ) 


where  X?  is  the  collection  of  all  cliques.  This  result  amounts  to  saying  that  the 
joint  distribution  must  separate  in  terms  of  its  system  of  cliques. 

We  now  embark  on  the  description  of  two  specific  MRFs  that  are  appro¬ 
priate  for  image  analysis,  namely  the  Ising  model  used  for  binary  images  and 
its  extension,  the  Potts  model ,  used  for  images  with  more  than  two  colors. 


8.2.3  The  Ising  Model 


If  pixels  of  the  image  x  under  study  can  only  take  two  colors  (black  and 
white,  say,  as  in  Fig.  8.1),  x  is  binary.  We  typically  refer  to  each  pixel  Xi  as 
being  foreground  if  xi  =  1  (black)  and  background  if  xi  =  0  (white).  The 
conditional  distribution  of  a  pixel  is  then  Bernoulli,  with  the  corresponding 
probability  parameter  depending  on  the  other  pixels.  A  simplification  step  is 
to  assume  that  it  is  a  function  of  the  number  of  black  neighboring  pixels,  using 
for  instance  a  logit  link  as  (j  =  0, 1) 


j|x_i)  (X  exp(/3 Uij) , 


(3>  0, 


(8.3) 


where  n,;.j  =  V n ( i >  I X(=j  is  the  number  of  neighbors  of  x,  with  color  j.  The 
Ising  model  is  then  defined  via  these  full  conditionals 

,  ,  x  exp(/3nj  i) 

ir{Xi  =  1  x_*)  = - — - — - — - #  , 

exp[priito )  +  exp(/3nitl ) 

"This  representation  is  by  no  means  limited  to  MRFs:  it  holds  for  every  joint 
distribution  such  that  the  full  conditionals  never  cancel.  It  is  called  the  Hammersley- 
Clifford  theorem ,  and  a  two-dimensional  version  of  it  was  introduced  in  Exercise  3.10. 
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and  the  joint  distribution  therefore  satisfies 


7 r(x)  oc  exp 


where  the  summation  is  taken  over  all  pairs  (i,j)  of  neighbors  (Exercise  8.17). 

When  inferring  on  /?  and  thus  simulating  the  posterior  distribution  /3,  we 
will  be  faced  with  a  major  obstacle,  namely  that  the  normalizing  constant  of 
(8.4),  Z(/3),  is  intractable  except  for  very  small  lattices  X,  while  depending  on 
f3.  Therefore  the  likelihood  function  cannot  be  computed.  We  will  introduce 
in  Sect.  8.3  a  computational  technique  called  ABC  that  is  intended  to  fight 
this  very  problem.  At  this  early  stage,  however,  we  consider  (3  to  be  known 
and  focus  on  the  simulation  of  x  in  preparation  for  the  inference  on  both  / 3 
and  x  given  a  noisy  version  of  the  image,  y,  as  presented  in  Sect.  8.4. 

The  computational  conundrum  of  Ising  models  goes  deeper  as,  due  to  the 
convoluted  correlation  structure  of  the  Ising  model,  a  direct  simulation  of  x  is 
not  possible,  expect  in  very  specific  cases.  Faced  with  this  difficulty,  the  image 
community  very  early  developed  computational  tools  which  eventually  led  in 
1984  to  the  proposal  of  the  Gibbs  sampler  (Sect.  3. 5.1). 8  The  specification  of 
Markov  random  fields  and  in  particular  of  the  Ising  model  implies  the  full 
conditional  distributions  of  those  models  are  available  in  closed  form.  The 
local  structure  of  Markov  random  fields  thus  provides  an  immediate  site-by- 
site  update  for  the  Gibbs  sampler: 


Algorithm  8.16  Ising  Gibbs  Sampler 

Initialization:  For  i  Gl,  generate  independently 

xf]  ~&(l/2). 


Iteration  t  (t  >  1): 

1.  Generate  u  =  ( Ui)iex ,  a  random  ordering  of  the  elements  of  X. 

2.  For  1  <  £  <  \X\,  update  0  and  nuj  i'  ar|d  generate 


exp(/?ni‘)ii) 


Ui 


exp(/?nV  0)  +  exp(/3n^y ) 


(*) 


In  this  implementation,  the  order  of  the  updates  of  the  pixels  of  X  is  random 
in  order  to  overcome  possible  bottlenecks  in  the  exploration  of  the  distribu- 

8 The  very  name  “Gibbs  sampling”  was  proposed  in  reference  to  Gibbs  random 
fields,  related  to  the  physicist  Willard  Gibbs.  Interestingly,  both  of  the  major  MCMC 
algorithms  are  thus  named  after  physicists  and  were  originally  developed  for  prob¬ 
lems  that  were  beyond  the  boundaries  of  (standard)  statistical  inference. 
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tion,  although  this  is  not  a  necessary  condition  for  the  algorithm  to  converge. 
In  fact,  when  considering  two  pixels  x\  and  x 2  that  are  m  pixels  apart,  the 
influence  of  a  change  in  x\  is  not  felt  in  x 2  before  at  least  m  iterations  of 
the  basic  Gibbs  sampler.  Of  course,  if  mn  is  large,  the  dependence  between  x\ 
and  X2  is  quite  moderate,  but  this  slow  propagation  of  changes  is  indicative 
of  slow  mixing  in  the  Markov  chain.  For  instance,  to  see  a  change  of  color  of  a 
relatively  large  homogeneous  region  is  an  event  of  very  low  probability,  even 
though  the  distribution  of  the  colors  is  exchangeable  (Exercise  8.18). 

^  If  d  is  large,  the  Ising  distribution  (8.4)  is  very  peaked  around  both  single  color 
configurations.  In  such  settings,  the  Gibbs  sampler  will  face  enormous  difficulties 
to  simply  change  the  value  of  a  single  pixel. 

Running  Algorithm  8.16  in  R  is  straightforward:  opting  for  a  four- neighbor 
relation,  if  we  use  the  following  function  for  the  number  of  neighbors  at  (a,  6), 

xneig4=function(x, a,b , col) { 
n=dim(x) [1] ;m=dim(x) [2] 
nei=c (x [a-1 ,b] ==col , x [a,b-l] ==col) 
if  (a!=n) 

nei=c (nei ,x [a+1 ,b] ==col) 
if  (b ! =m) 

nei=c (nei ,x [a,b+l] ==col) 
sum (nei) 

> 

the  above  Gibbs  sampler  can  be  written  as 

isingibbs=f unction (niter , n ,m=n , beta) { 

#  initialization 

x=sample (c (0 , 1) ,n*m,prob=c (0 . 5 , 0 . 5) ,rep=TRUE) 
x=matrix(x,n,m) 
for  (i  in  l:niter){ 
sampll=sample (1 :n) 
sampl2=sample ( 1 : m) 
for  (k  in  l:n){ 
for  (1  in  l:m){ 

n0=xneig4(x, sampll [k] ,  sampl2[l] ,0) 
nl=xneig4(x, sampll [k] , sampl2 [1] , 1) 
x [sampll [k] , sampl2 [1] ] =sample (c (0 , 1) , 1 , 

prob=exp (beta*c (nO ,nl) ) ) 

}» 

x 

> 

where  niter  is  the  number  of  times  the  whole  matrix  x  is  modified.  (It  should 
therefore  be  scaled  against  n*m,  the  size  of  x.)  Figure  8.3  presents  the  output 
of  simulations  from  Algorithm  8.16 
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Fig.  8.3.  Simulations  from  the  Ising  model  with  a  four-neighbor  neighbourhood 
structure  on  a  100  x  100  array  after  1,000  iterations  of  the  Gibbs  sampler:  f3  varies 
in  steps  of  0.1  from  0.3  to  1.2  ( first  column,  then  second  column ) 


>  image (1 : 100 , 1 : 100 , isingibbs (10~3, 100 , 100 , beta) ) 

for  different  values  of  / 3 .  Although  we  cannot  discuss  here  convergence  as¬ 
sessment  for  the  Gibbs  sampler  (see  Robert  and  Casella,  2009,  Chap.  8),  the 
images  thus  produced  are  representative  of  the  Ising  distributions:  the  larger 
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/3,  the  more  homogeneous  the  image  (and  also  the  slower  the  Gibbs  sampler).9 
When  looking  at  the  result  associated  with  the  larger  values  of  /?,  we  can  start 
to  see  the  motivations  for  using  such  representations  to  model  images  like  the 
Menteith  dataset,  discussed  in  Sect.  8.4. 

Along  with  the  slow  dynamic  induced  by  the  single-site  updating,  we  can 
point  out  another  inefficiency  of  this  algorithm,  namely  that  many  updates 
will  not  modify  the  current  value  of  x  simply  because  the  new  value  of  xi 
is  equal  to  its  previous  value!  It  is,  however,  straightforward  to  modify  the 
algorithm  so  that  it  only  proposes  changes  of  values.  The  update  of  each  pixel 
l  is  then  a  Metropolis-Hastings  step  with  acceptance  probability 

p  =  exp {j3mp-Xl)/ exp (f3iipXl )  A  1 , 

with  the  corresponding  R  function 

is inghm=funct ion (niter ,n,m=n,beta) { 

x=sample (c (0 , 1) ,n*m,prob=c (0 . 5 , 0 . 5) ,rep=TRUE) 
x=matrix(x,n,m) 
for  (i  in  l:niter){ 
sampll=sample (1 :n) 
sampl2=sample ( 1 : m) 
for  (k  in  l:n){ 
for  (1  in  l:m){ 

n0=xneig4(x, sampll [k] ,sampl2[l] ,x [sampll [k] , sampl2 [1] ] ) 
nl=xneig4(x, sampll [k] ,sampl2[l] , 1-x [sampll [k] ,sampl2[l]]) 
if  (runif (1) <exp (beta* (nl-nO) ) ) 

x [sampll [k] , sampl2 [1] ] =l-x [sampll [k] , sampl2 [1] ] 

»} 

x 

> 

Although  the  details  are  too  involved  to  be  included  here,  Liu  (1996)  has 
shown  that  this  alternative  is  faster  (to  converge)  than  the  original  Gibbs 
sampler. 

8.2.4  The  Potts  Model 

The  generalization  of  the  Ising  model  to  cases  when  the  image  has  more  than 
two  colors,  G  say,  is  straightforward.  If  n^g  denotes  the  number  of  neighbors 
of  i  G  X  with  color  g  (1  <g<  G),  that  is, 


9 In  fact,  there  exists  a  critical  value  of  /3,  Pc  =  2.269185  in  the  case  of  the  four 
neighbor  relation,  such  that,  when  [3  >  /3c,  the  Markov  chain  converges  to  one  of  two 
different  stationary  distributions,  depending  on  the  starting  point.  In  other  words, 
the  chain  is  no  longer  irreducible.  In  particle  physics,  this  phenomenon  is  called 
phase  transition. 
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rii 


■9  =  £i 


Xj=g  ? 


the  full  conditional  distribution  of  x*  is  chosen  as 


n(xi  =  g|x_j)  ex  exp (/9niifl) . 

This  choice  corresponds  to  a  (true)  joint  probability  model,  the  Potts  model, 
whose  density  is  given  by  (Exercise  8.6) 


7r(x)  oc  exp  /3  V'  IXj=Xi  .  (8.5) 

v  u  ) 

This  model  is  a  clear  generalization  of  the  Ising  model  and  it  suffers  from  the 
same  drawback,  namely  that  the  normalizing  constant  of  this  density — which 
is  a  function  of  [3 — is  not  available  in  closed  form  and  thus  hinders  inference 
and  the  computation  of  the  likelihood  function. 

Once  again,  we  face  the  hindrance  that,  when  simulating  x  from  a  Potts 
model  with  a  large  /?,  the  single-site  Gibbs  sampler  may  be  quite  slow.  More 
efficient  alternatives  are  available,  including  the  Swendsen-Wang  algorithm 
(Exercise  8.7).  For  instance,  Algorithm  8.17  below  is  again  a  Metropolis- 
Hastings  algorithm  that  forces  moves  on  the  current  values.  Note  the  special 
feature  that,  while  this  Metropolis-Hastings  proposal  is  not  a  random  walk, 
using  instead  a  uniform  proposal  on  the  G  —  1  other  possible  values  still  leads 
to  an  acceptance  probability  that  is  equal  to  the  ratio  of  the  target  densities. 


Algorithm  8.17  Potts  Metropolis-Hastings  Sampler 

Initialization:  For  i  Gl,  generate  independently 

xf]  ~  ^({1 ,...,(?}). 

Iteration  t  (t>  1): 

1.  Generate  u  =  (rq)iex  a  random  ordering  of  the  elements  of  X. 

2.  For  1  <  t  <  \I\, 

generate 

Xue  ~fy({l,2,...,x£-1')  -  1,  A*_1)  +  1, ..  .,G}) , 

compute  the  riulg  and 

Pi  =  {exp(/?n„e>^)/exp(/3n)fyuf )}  A  1 , 

and  set  Xu}  equal  to  xU(  with  probability  pi. 
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Figure  8.4  illustrates  the  result  of  a  simulation  using  Algorithm  8.17  in  a 
situation  where  there  are  G  =  4  colors,  using  the  following  R  function 

pottshm=function(ncol=2 , nit er= 10 ~4,n,m=n, bet a=0){ 
x=matrix (sample (1 :ncol,n*m,rep=TRUE) ,n,m) 
for  (i  in  l:niter){ 
sampl=sample ( 1 : (n*m) ) 
for  (k  in  1 : (n*m)){ 
xcur=x [sampl  [k] ] 
a=  ( s  ampl  [k]  - 1 )  0/00/0n+ 1 
b= (sampl  [k]  -  l)0/0/0/0n+l 
xtilde=sample ( (1 :ncol)  [-xcur] ,1) 

acpt=beta* (xneig4 (x , a , b , xt ilde) -xneig4 (x , a , b , xcur) ) 
if  (log(runif (1) ) <acpt)  x [sampl [k] ] =xt ilde 
» 

return (x) 

> 

for  the  simulation.  (The  use  of  a  single  vector  of  indices  for  rows  and  columns 
is  a  programming  trick  that  removes  a  loop  in  the  code  and  thus  saves  a 
considerable  amount  of  computing  time.  This  also  allows  a  true  uniform  dis¬ 
tribution  in  sampl.  Note  the  call  to  the  congruential  operators  TL  for  modulo 
and  0/o/7o  for  integer  division)  We  point  out  the  reinforced  influence  of  large 
/?’ s  on  Fig.  8.4:  not  only  is  the  homogeneity  higher,  but  there  is  also  a  larger 
differentiation  in  the  colors.10  We  stress  that,  while  / 3  in  Fig.  8.4  ranges  over 
the  same  values  as  in  Fig.  8.3,  the  /3’s  are  not  directly  comparable  since  the 
larger  number  of  classes  in  the  Potts  model  induces  a  smaller  value  of  the 
Ui^g  s  for  the  neighbourhood  structure. 


8.3  Handling  the  Normalizing  Constant 


While  simulating  random  variables  distributed  from  a  Potts  model  is  required 
in  several  settings,  one  of  which  we  will  cover  in  the  next  section,  a  more 
common  statistical  setting  is  observing  x  distributed  as 


/(x  I/?) 


where  Z(/3)  is  the  normalizing  constant  of  the  density  in  x,  and  inferring  upon 
the  parameter  /?,  using  for  instance  a  uniform  prior  f3  ~  ^(0,  2).*  11 


10 Similar  to  the  Ising  model  mentioned  in  Footnote  9,  there  also  exist  a  phase 
transition  phenomenon  and  a  critical  value  for  f3  in  this  model. 

11  The  upper  bound  on  [3  in  the  above  prior  is  chosen  for  a  very  precise  reason:  As 
mentioned  in  the  previous  footnotes,  when  f3  >  2,  the  Potts  model  associated  with 
a  four-neighbor  relation  is  almost  surely  concentrated  on  single-color  images.  It  is 
thus  pointless  to  consider  larger  values  of  (3. 
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Fig.  8.4.  Simulations  from  the  Potts  model  with  four  grey  levels  and  a  four-neighbor 
neighbourhood  structure  based  on  1,000  iterations  of  the  Metropolis-Hastings  sam¬ 
pler.  The  parameter  [3  varies  in  steps  of  0.1  from  0.3  to  1.2  (first  column,  then  second 
column) 


264  8  Image  Analysis 


The  primary  computational  difficulty  with  this  inference  is  the  unavail¬ 
ability  of  the  normalizing  constant 


Z(P)  =  ^exp{^S'(x)}  , 


X 


where  S'(x)  =  The  above  summation  operates  over  the 

Glxl  possible  values  of  x,  where  \X\  denotes  the  size  of  1.  It  involves  too 
many  terms  to  be  manageable.  In  the  case  of  the  Ising  model,  the  number 
of  terms  in  the  sum  is  for  instance  2  to  the  power  the  number  of  points  in 
the  lattice.  For  a  small  256  x  256  black-and-white  image,  there  are  therefore 
265536  terms  in  the  sum!  Furthermore,  this  is  not  a  setting  where  a  standard 
MCMC  solution  would  apply  because  of  the  same  difficulty:  a  Metropolis- 
Hastings  algorithm  also  requires  the  evaluation  of  the  ratio  Z(/3)/Z(f3)  in  the 
acceptance  probability.  Unsurprisingly,  addressing  the  approximation  of  Z(f3) 
has  given  rise  to  a  huge  literature,  as  shown  by  Ripley  (1988)  and  Rue  and 
Held  (2005),  but  the  solutions  are  mostly  too  convoluted  for  this  book  (see, 
e.g.,  the  auxiliary  variable  method  of  Mpller  et  ah,  2006).  We  first  describe  a 
semi-practical  resolution  of  this  difficulty,  called  path  sampling ,  which  is  costly 
in  computing  time  for  large  images,  before  moving  to  a  more  generic  if  less 
precise  solution. 


8.3.1  Path  Sampling 

The  path  sampling  technique  is  based  on  a  derivative  representation  of  the 
normalizing  constant.  Since 

=^2s(x)  exp (/3S(x)) , 


we  can  express  this  derivative  as  an  expectation  under  7r(x| /?), 

^  =  ZW)  £  SW  =  m  E„[S(x, 


that  is. 


dlog  Z(/3) 

d/3 


=  E/3[5(X) 


Therefore,  the  ratio  Z  (fii)  /  Z  (/3q)  can  be  represented  as  an  integral, 

rPi 

log{Z(/31)/Z((30)}=  /  E/3  [S(x)]d/3 , 

J  Pa 


(8.7) 


leading  to  the  path  sampling  identity  (see  Chen  et  ah,  2000,  for  many  more 
details  about  this  technique.) 
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Although  (8.7)  may  not  look  like  a  considerable  improvement,  since  we  now 
have  to  compute  an  expectation  in  x  plus  an  integral  over  /?,  the  represen¬ 
tation  (8.7)  is  appealing  because  we  can  use  standard  simulation  procedures 
for  its  approximation.  First,  for  a  given  value  of  /?,  E^[5'(X)]  can  be  approx¬ 
imated  from  an  MCMC  sequence  simulated  by  Algorithm  8.17.  Obviously, 
changing  the  value  of  f3  should  involve  a  new  simulation  run,  however  the  cost 
can  be  attenuated  by  using  instead  importance  sampling  for  similar  values 
of  /3.  Second,  the  integral  itself  can  be  approximated  by  numerical  quadra¬ 
ture,  namely  by  computing  the  value  of  /(/?)  =  E^[5(X)]  for  a  finite  number 

A 

of  values  of  f3  and  approximating  /(/3)  by  a  piecewise-linear  function  /(/?)  for 
the  intermediate  values  of  /3.  Indeed,  for  arbitrary  /3q  and  f3 1, 

JPl  na)  d/? «  Hdo)  +  {/(/? i)  -  fm} (/?1  ~/o)2  - 


A 

where  /(/?)  is  approximated  by  the  above  Monte  Carlo  method. 

The  rendering  of  the  above  in  R  for  Laichedata  is  as  follows  for  a  four- 
neighbor  relation:  the  expectation  E^[S'(X)]  is  approximated  via  the  following 
R  function 


sumising=function(niter=10''3 ,  numb ,  beta)  { 

S=0 

x=matrix (sample (c(0, 1) ,numb~2,rep=TRUE) ,ncol=numb) 
for  (i  in  1: niter) { 
s=0 

sampll=sample (1 :numb) 
sampl2=sample ( 1 : numb) 
for  (k  in  l:numb){ 
for  (1  in  l:numb){ 

n0=xneig4(x, sampll [k] , sampl2 [1] ,x [sampll  [k] , sampl2  [1] ] ) 
nl=xneig4(x, sampll  [k] , sampl2  [1] , 1-x  [sampll [k] , sampl2 [1] ] ) 
if  (log(runif (1) ) < (beta* (nl-nO) ) ) { 

x [sampll [k] , sampl2  [1] ] =l-x  [sampll  [k] , sampl2 [1] ] 
nO=nl> 
s=s+nO 
» 

if  (2*i>niter) 

S=S+s 

> 

return (2*S/niter) 

> 

for  a  few  selected  values  of  /3,  while  the  whole  function  f(/3 )  is  then 
approximated  using  the  R  procedure  approxfun  as 

Z=seq(0 , 2 ,by= . 1) 
for  (i  in  1:21) 
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Z [i] =sumising(numb=24,beta=Z [i]  ) 
lrcst=approxfun(seq (0,2,0 .1) ,  Z) 

This  approximation  is  illustrated  by  Fig.  8.5.  The  ratio  of  the  constants, 
Z(/3)/Z(/3)  is  provided  by  the  R  numerical  integration  function,  integrate,  as 

Zratio=integrate (lrcst ,betatilde ,beta) $value 

and  can  be  easily  inserted  within  a  random  walk  Metropolis-Hasting 
algorithm.  Indeed,  now  that  we  have  painstakingly  constructed  a  satisfactory 


Fig.  8.5.  Monte  Carlo  approximation  of  E^fS^X)]  for  a  24  x  24  Ising  model,  based 

Q 

on  10  iterations.  The  irregularity  at  the  penultimate  value  of  f3  can  be  attributed 
to  a  failed  convergence  of  the  Gibbs  sampler 


approximation  of  Z (fii) / Z (fio)  for  any  arbitrary  pair  (/?o,/?i),  we  can  rim  an 
MCMC  sampler  targeting  the  posterior  distribution  7t(/3|x)  ,  where  simulation 
at  iteration  t  is  based  on  the  proposal 


p 


h,  pt-V  +  h]) ; 


that  is,  a  uniform  move  with  range  2 h.  The  acceptance  ratio  associated  with 
the  pair  (/Tt_1),  /3)  is  thus  given  by 

which  translates  into  the  R  code 
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betatilde=beta  [t-1] +runif (1,-0.05,0.05) 
laccept=lvr* (betatilde-beta [t-1] ) +integrate (lrcst , 
betatilde ,beta [t-1] ) $value 
if  (runif (1) <exp(laccept) ) { 
beta [t] =betatilde}else{ 
beta  [t]  =beta  [t-1]  } 

The  outcome  of  this  MCMC  algorithm  is  represented  by  the  histogram  of 
Fig.  8.6,  which  exhibits  a  very  regular  posterior  distribution  for  /3,  which  is 
symmetric  around  0.47.  Thanks  to  the  path  sampling  approximation  to  Z(/3), 
running  105  iterations  is  almost  instantaneous. 


0.35  0.40  0.45  0.50  0.55  0.60 


P 

Fig.  8.6.  Dataset  Laichedata:  Histogram  of  the  MCMC  sample  of  /?’ s  produced 
using  the  path  sampling  approximation  to  the  ratio  Z(/3)/Z(f3),  when  based  on  105 
iterations 


8.3.2  The  ABC  Method 

In  a  general  setting  where  the  likelihood  function  is  not  available  in  a  closed 
form,  the  trick  at  the  core  of  the  path  sampling  technique  is  not  always  avail¬ 
able.  (Consider  for  instance  the  case  of  a  multivariate  /?.)  We  thus  need  to  turn 
towards  faster  if  more  rudimentary  approximations  and  a  method  of  choice 
is  the  ABC  (approximate  Bayesian  computation)  technique,  introduced  by 
Pritchard  et  al.  (1999)  in  population  genetic  settings. 

The  method  starts  from  a  valid  rejection  technique  bypassing  the 
computation  of  the  likelihood  function.  Namely,  if  we  observe  x  rsj  f(x\0) 
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and  if  7 r(0)  is  the  prior  distribution  on  the  parameter  0,  then  an  algorithm 
that  jointly  simulates 


O'  ~  7 x(0)  and  y  ~  f(y\0') 

and  accepts  the  simulated  O'  if,  and  only  if,  the  auxiliary  variable  y  is  equal 
to  the  observed  value, 

x  =  y, 

is  exact  in  the  sense  that  the  accepted  0n s  are  distributed  from  the  posterior. 
Obviously,  the  algorithm  is  not  practical  in  cases  when  x  is  continuous  or  even 
takes  a  large  enough  number  of  values.12  In  most  standard  occurrences,  the 
ABC  algorithm  starts  with  an  approximation,  in  the  sense  that  the  equality 
constraint  x  =  y  is  replaced  with  a  tolerance  condition,  g{x,  y)  <  e,  where  g 
is  a  measure  of  discrepancy  between  x  and  y.  We  will  call  e  >  0  the  tolerance 
bound  and  g  will  be  chosen  as  a  distance  between  summary  statistics.  The 
output  of  the  ABC  algorithm  is  then  distributed  from  the  distribution  with 
density  proportional  to 


7 r 


(0)Fe(g(x,y)  <  e\x) 


1 


where  the  probability  is  associated  with  y  ~  f(y\0).  This  density  is  denoted 
by  7 \e(0  |  x). 

If  the  tolerance  e  is  “too  large” ,  the  approximation  is  poor;  to  understand 
why,  consider  that,  when  e  goes  to  00,  the  ABC  algorithm  amounts  to  sim¬ 
ulating  from  the  prior  since  all  simulations  are  accepted.  If  e  is  sufficiently 
small,  7Te(0\x)  is  a  good  approximation  of  7r(0\x),  but  the  acceptance  proba¬ 
bility  may  be  too  low  for  this  value  to  be  practical.  Selecting  the  “right”  e  is 
thus  crucial.  It  is  customary  to  pick  e  as  an  empirical  quantile  of  g(x,  y)  when 
y  is  simulated  from  the  marginal 


J  w(0)f(y\O)de 

and  the  choice  is  often  the  corresponding  1  %  quantile.  This  quantile  is  easily 
approximated  by  simulation. 

In  settings  when  the  data  x  has  a  large  dimension,  the  ABC  algorithm 
uses  instead  a  distance  between  summary  statistics  g(S(x),  S(y))  rather  than 
a  distance  between  x  and  y.  This  choice  throws  away  some  information  con¬ 
tained  in  the  data  about  0,  but  it  also  allows  to  concentrate  on  important 
features  of  the  data  in  order  to  bring  a  maximal  discrimination  between  the 
observed  and  the  simulated  statistics.  It  is  thus  rarely  the  case  that  S'  is  a 


12 Note  that,  for  Laichedata,  it  is  possible  to  wait  for  the  equality  S(x)  =  S(y) 
with  a  sufficiently  high  probability.  In  that  case,  since  S  is  a  sufficient  statistic,  we 
are  simulating  from  the  exact  posterior. 
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sufficient  statistic.1  In  the  general  case,  the  output  of  the  ABC  algorithm  is 
therefore  a  simulation  from  the  distribution  i re(0  \  x).  The  ABC  algorithm 
thus  reads  as  follows: 


Algorithm  8.18  ABC  algorithm  For  i  =  1, . . . ,  N, 


1.  Generate  6i  from  the  prior  7 r. 

2.  Generate  yi  from  the  model  distribution  f(x\0i). 

3.  Compute  the  distance  g(S(yi),  S(x)). 


Deduce  e  as  the  1%  quantile  of  the  distances.  Accept  the  Si's  such  that 

e(S(x),S(yi))  <  e. 


To  illustrate  the  ABC  method  in  a  simple  environment,  consider  the  prob¬ 
lem  already  processed  in  Chap.  2  about  assessing  whether  a  normal  Nb(/i,  a2) 
distribution  has  a  zero  mean,  fi  =  0.  As  explained  in  Sect.  2.3.1,  the  natural 
Bayesian  approach  is  to  include  the  model  index  9JI  as  an  extra  parameter 
taking  only  the  values  1  (when  fi  —  0)  and  2  (when  g  ^  0).  In  other  words, 
Bayesian  inference  covers  the  pair  (9Jt,  0),  conditional  on  the  data  Q)n.  Sim¬ 
ulating  by  ABC  from  the  posterior  on  (911,  0)  given  Q)n  then  follows  from 
Algorithm  8.18: 

1.  Generate  9JP  uniformly  at  random  on  {1,  2}  (i  =  1, . . . ,  n). 

2.  Generate  0i  from  the  prior  7t(0|9JT)  (i  =  1, . . . ,  n). 

3.  Generate  Q>ln  from  the  normal  model  indexed  by  (93 V,  6i)  (i  =  1, . . . ,  n). 

4.  Compute  the  distances  between  the  statistics  (x(@n),s2(@n))  and 

D)  (i  =  1, ...  ,n). 

5.  Deduce  e  as  the  1  %  quantile  of  the  distances. 

6.  Accept  the  93T’s  for  which  the  distances  are  less  than  e. 

The  distance  we  pick  is  inspired  from  the  likelihood  function,  namely 

e{(x(@n),  s2{3> „)),  (x(@*)ys2 (&*))}  =  n{x(3>n)  -  x(@*)}2 

+  {s2(®n)/s2 ($>*)}  -  1  -  log {s2(2>n)/s2m}  . 

The  implementation  is  then  straightforward:  we  select  one  of  the  models  at 
random,  simulate  from  the  corresponding  (necessarily  proper)  prior  on  the 
parameter (s)  and  create  a  normal  sample  ^*.  The  posterior  probability  of 
the  model  associated  with  fi  =  0  is  then  estimated  by  the  proportion  of  ac¬ 
cepted  simulations  from  the  simpler  model.  Under  an  <?(1)  prior  on  a2  in  both 
models  and  a  cb(0,cr2)  on  fi  under  the  larger  model,  with  the  normaldata 
benchmark,  the  R  code  goes  as  follow: 

-1  Q  _ 

The  setting  of  Markov  random  fields  like  the  Ising  and  the  Potts  models  is 
an  exception  in  that  it  allows  for  a  sufficient  statistic,  while  being  intractable  via 
classical  approaches. 
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>  xbar=mean(normaldata) 

>  s2= (n-1) *var (normaldata) 

>  Nsim=10''6  # simulations  from  the  prior 

>  indem=sample (c (0 , 1) ,Nsim,rep=TRUE) 

>  ssigma=l/rexp (Nsim) 

>  smu=rnorm(Nsim) *sqrt (ssigma) * (indem==l) 

>  ss2=s2/ (ssigma*rchisq(Nsim,n-l) ) 

>  sobs=n* (rnorm(Nsim, smu, sqrt (ssigma/n) ) -xbar) "2+ 

+  ss2-l-log(ss2) 

>  epsi=quantile (sobs 001)  #bound  and  selection 

>  prob=sum(indem [sobs<=epsi] ==0) / (0.001*Nsim) 

>  (1-prob) /prob 
[1]  0.1574074 


producing  a  numerical  value  to  be  compared  with  the  exact  Bayes  factor 


(n  +  1) 


■i/2 


nx 2  +  sz  +  2 


■  n~ (-2  /  2 


nx2  /  (n  +  1)  +  s2  +  2 


(deduced  from  the  derivation  on  page  45  by  modifying  for  the  exponential 
prior),  which  is  equal  to  0.1369  for  normaldata.  Figure  8.7  represents  the 
variability  of  the  ABC  approximation  compared  with  the  true  value. 


8.3.3  Inference  on  Potts  Models 

If  we  consider  the  specific  case  of  the  posterior  distribution  associated  with 
(8.6)  and  a  uniform  prior,  Algorithm  8.18  simulates  values  of  ft  uniformly  over 

0.05 
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0.15 

0.20 

0.25 

0.30 

Fig.  8.7.  Dataset  normaldata:  Boxplot  representation  of  the  ABC  approximation 
to  the  Bayes  factor,  which  true  value  is  represented  by  an  horizontal  line ,  based  on 
105  proposals,  a  1  %  acceptance  rate,  and  500  replications 
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(0,  2)  and  then  values  x  from  the  Potts  model  (8.6).  Simulating  a  data  set  x  is 
unfortunately  non-trivial  for  Markov  random  fields  and  in  particular  for  Potts 
models,  as  we  already  discussed.  While  there  exist  developments  towards  this 
goal  in  the  special  case  of  the  Ising  model — in  the  sense  that  they  produce 
exact  simulations,  at  a  high  computing  cost — ,  we  settle  for  using  a  certain 
number  of  steps  of  an  MCMC  sampler  (for  instance,  Algorithm  8.17)  updat¬ 
ing  one  clique  at  a  time  conditional  on  the  others.  Obviously,  this  solution 
brings  a  further  degree  of  approximation  into  the  picture  in  that  running  a 
fixed  number  of  iterations  of  the  MCMC  sampler  does  not  produce  an  exact 
simulation  from  (8.6).  There  is  however  little  we  can  do  about  this  if  we  want 
to  use  ABC.  (And  we  can  further  argue  that  ABC  involves  such  a  significant 
departure  from  the  exact  posterior  that  an  imperfect  MCMC  simulation  does 
not  matter  so  much!) 

Since,  for  every  new  value  of  /?,  the  algorithm  runs  a  full  MCMC  simula¬ 
tion,  we  need  to  discuss  the  choice  of  the  starting  value  as  well.  There  are  (at 
least)  three  natural  solutions: 

start  completely  at  random; 

start  from  the  previously  simulated  x. 

always  start  from  the  observed  value  xq; 


The  first  one  is  the  closest  to  the  MCMC  idea  and  it  produces  independent 
outcomes.  The  second  solution  is  less  compelling  as  the  continuity  it  creates 
between  draws  is  not  statistically  meaningful,  given  that  the  simulated  /Ts 
change  (independently  or  not)  from  one  step  to  the  other.  The  third  solution 
offers  the  appealing  feature  of  connecting  with  the  observed  value  xo,  thus 
favoring  proximity  between  the  simulated  and  the  observed  values,  but  this 
feature  could  confuse  the  issues  in  that  this  proximity  may  be  due  to  a  poor 
mixing  of  the  chain  rather  than  to  a  proper  choice  for  /3.  (For  instance,  in 
the  extreme  case  the  MCMC  chain  does  not  move  from  xq,  x  =  Xq  does  not 
mean  that  the  simulated  /?  is  at  all  interesting  for  7r(/?|xo). . . )  The  distance 
used  in  step  3  of  Algorithm  8.18  is  the  (natural)  absolute  difference  between 
the  sufficient  statistics  5(x)  and  5(x0),  with 


s (x)  =  • 

i^j 

For  the  four-neighbour  relation,  the  statistic  can  be  computed  directly  without 
loops  as 

sum (x [-1 , ]  ==x [-n , ]  )+sum(x[,-l]==x[,-m]  ) 

and  the  whole  R  code  corresponding  to  a  random  start  of  the  Metropolis- 
Hastings  algorithm  is  as  follows: 

>  ncol=4;  nrow=10;  Nsim=2*10~4;  Nmc=10~2 

>  suf 0=sum(x0  [-1 ,] ==x0  [-nrow,] )+sum(x0 [, -1] ==x0 [, -nrow] ) 

>  outa=dista=rep (0 ,Nsim) 
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>  for  (tt  in  l:Nsim){ 

+  beta=runif (1 ,max=2) 

+  xprop=pottshm(ncol ,nit=Nmc ,n=nrow,beta=beta) 

+  dista [tt] =abs (suf 0- (sum(xprop  [-1 , ] ==xprop [-nrow, ] )  + 

+  sum(xprop [,-l] ==xprop [,-ncol]  ))) 

+  outa [tt] =beta 

+  } 

betas=outa [order (dista) <= . 01*Nsim] 

Note  the  inequality  sign  <=  and  the  use  of  jitter  to  get  exactly  0.01*Nsim 
values  in  the  vector  beta.  This  is  due  to  the  fact  that  the  statistic  S  takes 
integer  values. 

When  applying  the  above  to  the  Laichedata  dataset,  we  obtain  the  out¬ 
come  represented  in  Fig.  8.8.  When  comparing  with  Fig.  8.6,  we  can  check  that 
ABC  produces  an  almost  exact  representation,  even  though  e  is  not  equal  to 
zero.  As  mentioned  above,  it  would  actually  be  feasible  to  achieve  e  =  0  with 
a  larger  number  of  simulations. 


P 


Fig.  8.8.  Dataset  Laichedata:  Histogram  of  the  sample  of  d’s  produced  using  an 
ABC  algorithm  with  104  iterations  and  a  1  %  quantile  on  the  difference  between  the 
sufficient  statistics  as  its  tolerance  bound  e 
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8.4  Image  Segmentation 

In  this  section,  we  still  consider  images  as  statistical  objects,  but  they  are 
now  “noisy”  in  the  sense  that  the  color  or  the  grey  level  of  a  pixel  is  not 
observed  exactly  but  with  some  perturbation  (sometimes  called  blurring  as  in 
satellite  imaging).  The  purpose  of  image  segmentation  is  to  cluster  pixels  into 
homogeneous  classes  without  supervision  or  preliminary  definition  of  those 
classes,  based  only  on  the  spatial  coherence  of  the  structure. 

This  underlying  structure  of  the  “true”  pixels  is  denoted  by  x,  while  the  ob¬ 
served  image  is  denoted  by  y.  Both  objects  x  and  y  are  arrays,  with  each  entry 
of  x  taking  a  finite  number  of  values  and  each  entry  of  y  taking  real  values 
(for  modeling  convenience  rather  than  reality  constraints).  We  are  thus  inter¬ 
ested  in  the  posterior  distribution  of  x  given  y  provided  by  Bayes’  theorem, 
7r(x|y)  oc  /(y|x)7r(x).  In  this  posterior  distribution,  the  likelihood,  /(y|x), 
describes  the  link  between  the  observed  image  and  the  underlying  classifica¬ 
tion;  that  is,  it  gives  the  distribution  of  the  noise,  while  the  prior  7 r(x)  encodes 
beliefs  about  the  (possible  or  desired)  properties  of  the  underlying  image.  Al¬ 
though,  as  in  other  chapters,  we  cannot  provide  the  full  story  of  Bayesian 
image  segmentation,  an  excellent  tutorial  on  Bayesian  image  processing  based 
on  a  summer  school  course  can  be  found  in  Hurn  et  al.  (2003). 

As  indicated  above,  a  proper  motivation  for  image  segmentation  is  satellite 
processing  since  images  caught  by  satellites  are  often  blurred,  either  because 
of  inaccuracies  in  the  instruments  or  transmission  or  because  of  clouds  or 
vegetation  cover  between  the  satellite  and  the  area  of  interest. 

The  Menteith  dataset  that  motivates  this  section  is  a  100  x  100  pixel 
satellite  image  of  the  lake  of  Menteith,  as  represented  in  Fig.  8.9.  The  lake 
of  Menteith  is  located  in  Scotland,  near  Stirling,  and  offers  the  peculiarity 
of  being  called  “lake”  rather  than  the  traditional  Scottish  “loch.”  As  shown 
by  the  image,  there  are  several  islands  on  this  lake,  one  of  which  houses  an 
ancient  abbey.  The  purpose  of  analyzing  this  satellite  dataset  is  to  classify  all 
pixels  into  one  of  six  states  in  order  to  detect  some  homogeneous  regions. 

The  model  being  introduced,  we  turn  to  the  central  issue,  namely  how  to 
draw  inference  on  the  “true”  image,  x,  given  an  observed  noisy  image,  y.  The 
prior  on  x  is  a  Potts  model  with  G  categories, 


7 r 


where  Z(/3 )  is  the  (intractable,  see  Sect.  8.3)  normalizing  constant  of  the  Potts 
model.  Given  x,  we  assume  that  the  observations  in  y  are  independent  normal 
random  variables, 


/(y|x,Cr2,^i,...,/UG)  =  H 


1 


ieT 


(27T<J2)1/2 


exp 


274  8  Image  Analysis 


Fig.  8.9.  Dataset  Menteith:  Satellite  image  of  the  lake  of  Menteith 


This  model  is  not  exact  in  that  the  s  are  integer  grey  levels  that  vary 
between  0  and  255,  but  it  is  easier  to  handle  than  a  parameterized  distribution 
on  {0, . . . ,  255}.  This  setting  is  clearly  reminiscent14  of  the  mixture  and  hidden 
Markov  models  of  Chaps.  6  and  7  in  that  a  Markov  structure,  the  Markov 
random  field,  is  only  observed  through  random  variables  indexed  by  the  states. 

In  this  problem,  the  parameters  /?,  cr2,  /ii, . . . ,  fie  are  usually  considered 
to  be  nuisance  parameters,  a  point  of  view  that  justifies  the  use  of  uniform 
priors  like 


0~<2r([O,2]), 

n  =  (j,g)  ~  ({m  ;  o  <  111  <  . . .  <  no  <  255}) , 

7T(<T2)  OC  CT_2I]0,oo[(o-2)  , 

the  last  prior  corresponding  to  a  uniform  prior  on  logcr. 

The  upper  bound  on  [3  has  been  discussed  in  the  previous  section.  The 
ordering  of  the  fig  s  is  not  necessary,  strictly  speaking,  but  it  avoids  the  label 
switching  phenomenon  discussed  in  Sect.  6.5.  (The  alternative  is  to  use  the 
same  uniform  prior  on  all  fig  s  and  then  reorder  them  once  the  MCMC  sim¬ 
ulation  is  done.  While  this  may  avoid  slow  convergence  behaviors  in  some 
cases,  this  strategy  also  implies  more  involved  bookkeeping  and  higher  stor¬ 
age  requirements.  In  the  case  of  large  images,  it  simply  cannot  be  considered.) 


14 Besides  image  segmentation,  another  typical  illustration  of  such  structures  is 
character  recognition  where  a  machine  scans  handwritten  documents,  e.g.,  envelopes, 
and  must  infer  a  sequence  of  symbols  (i.e. ,  numbers  or  letters)  from  digitized  pic¬ 
tures.  Hastie  et  al.  (2001)  provide  an  illustration  of  this  problem. 


8.4  Image  Segmentation  275 


The  corresponding  posterior  distribution  is  thus 

tt(x,  (5,  a2,  n\y)  cx  7 r(/3,  a2,  n)  x  —hr  exp  I  /3  ^  1^.=^ 

V  '  \  j~i 

XII 

iex  v  y  k 

We  can  therefore  construct  the  various  full  conditionals  of  this  joint  distri¬ 
bution  with  a  view  to  the  derivation  of  a  hybrid  Gibbs  sampler  for  this  model. 
First,  the  full  conditional  distribution  of  Xi  (i  E  X)  is  (1  <9<G) 


9 |y,  P,  v2,  M)  cx  exp  <  p  ^  Ix. 


-9 


jr^l 


1 

2^ 


(yi  -  »g) 


1 


which  can  be  simulated  directly,  even  though  this  is  no  longer  a  Potts  model. 
As  in  the  mixture  and  hidden  Markov  cases,  once  x  is  known,  the  groups  asso¬ 
ciated  with  each  category  g  separate  and  therefore  the  fig  s  can  be  simulated 
independently  conditional  on  x,  y,  and  a2 .  More  precisely,  if  we  denote  by 

n9  =  'y  ]  ^-Xj=g  and  Sg  =  y  ]  Ixj^gUi 

iex  iex 


the  number  of  observations  and  the  sum  of  the  observations  allocated  to  cat¬ 
egory  g,  respectively,  the  full  conditional  distribution  of  fig  is  a  truncated 
normal  distribution  on  lfig_i,  fig+i\  (setting  /x0  =  0  and  fiQ+ 1  =  255)  with 
mean  sg/ng  and  variance  <j2/ng.  (Obviously,  if  no  observation  is  allocated 
to  this  group,  the  conditional  distribution  turns  into  a  uniform  distribution 
on  [/ip_i,  /ig+i].)  The  full  conditional  distribution  of  a2  is  an  inverse  gamma 
distribution  with  parameters  |Z|2/2  and  (yi  -  MzJ2/2-  Filially,  the  full 

conditional  distribution  of  (3  is  such  that 


k{P |y)  oc 


since  /?  does  not  depend  on  a2 ,  fi,  and  y,  given  x.  As  discussed  in  Sect.  8.3.1, 
a  path  sampler  can  provide  an  approximation  for  the  ratio  of  normalizing 
constants. 
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In  the  case  of  the  Menteith  data,  we  use  a  four-neighbour  neighbourhood 
and  G  =  6  on  a  100  x  100  image.  For  (3  ranging  from  0  to  2  by  steps  of  0.1, 
the  approximation  to  /(/?)  is  based  on  15,000  iterations  of  Algorithm  8.17 
(after  burn-in),  following  the  same  procedure  as  with  Fig.  8.5.  The  resulting 
piecewise-linear  function  is  given  in  Fig.  8.10  and  is  smooth  enough  for  us 
to  consider  the  approximation  as  acceptable.  (We  use  these  numerical  values 
in  the  clustering  function  reconstruct  as  the  vector  dali.)  Note  that  the 
increasing  nature  of  the  function  /  in  /3  is  intuitive:  As  f3  grows,  the  probability 
of  having  more  neighbors  of  the  same  category  increases  and  so  does  S'(x). 


o 

o 


Fig.  8.10.  Approximation  of  /(d)  f°r  the  Potts  model  on  a  100  x  100  image,  a 
four-neighbour  neighbourhood,  and  G  =  6,  based  on  1,500  MCMC  iterations  after 
burn-in 


The  corresponding  R  code  for  6  colors  and  4  neighbors  (which  are  the 
specifications  for  the  Menteith  dataset)  is  as  follows: 

r econstruct=f unction (nit er=10~ 3 ,y) { 

numb=dim(y) [1] 

x=0*y 

mu=matr ix (0 , niter , 6) 
sigma2=rep (0 , niter) 

#prior  input 

mu [1 , ] =c (35 , 50 , 65 , 84 , 92 , 120) 
sigma2  [1] =100 
beta=rep (1 , niter) 
xcum=matrix(0 ,numb~2 , 6) 
n=rep(0 , 6) 
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dali=c (6667 . 729 , 7245 . 159 , 7856 . 514 , 8523 . 00 , 9242 . 127 , 10025 .211, 
10896 . 380 ,11877. 379 , 12985 . 344 , 14360 . 080 , 16062 . 470 , 18408 . 592 , 
22755 .124,33163.207, 35947 . 756 , 36745 . 675 , 38286 . 608 , 38534 . 912 , 
38531 . 211 , 38916 . 662 , 38495 . 781) 
thefunc=approxfun(seq(0 , 2 , length=21) ,dali) 

for  (i  in  2:niter){ 
lvr=0 

for  (k  in  l:numb){ 
for  (1  in  l:numb){ 
for  (co  in  1:6) 

n  [co] =xneig4(x,k, 1 , co) 
x  [k, 1] =sample (1 : 6 , 1 ,prob=exp(beta[i-l]  *n)  * 

dnorm(y[k,l] ,mu[i-l,] , sqrt (sigma2 [i-1]  )  )  ) 
xcum [ (k— 1 ) *numb+l ,x [k, 1] ] =xcum [(k-1) *numb+l ,x [k, 1] ] +1 
lvr=lvr+n [x [k,l]] 

» 

mu [i , 1] =truncnorm(l ,mean(y [x==l] ) , sqrt (sigma2 [i-1] / 

sum(x==l) ) , 0 ,mu [i-1 , 2] ) 
for  (co  in  2:5) 

mu [i , co] =truncnorm(l ,mean(y [x==co] ) , sqrt (sigma2 [i-1] / 
sum(x==co) ) ,mu [i , co-1] ,mu [i-1 , co+1] ) 
mu [i , 6] =truncnorm(l ,mean(y [x==6] ) , sqrt (sigma2 [i-1] / 
sum(x==5)) ,mu[i,5] ,255) 
sese=sum( (y-mu  [i , 1] ) "2* (x==l) ) 
for  (co  in  2:6) 

sese=sese+ (y-mu [i , co] ) ~2* (x==co) ) 
sigma2 [i] =l/rgamma(l ,numb~2/2 , sese/2) 
betilde=beta [i-1] +runif (1,-0.05,0.05) 
laccept=vr* (betatilde-beta [i-1] )+integrate (thefunc , 
betatilde ,beta  [i-1] ) $value 
integrate (lrcst ,betilde ,beta[i-l] ) $value 
if  (log(runif (1) ) <laccept) { 

beta[i] =betilde}else{beta[i] =beta[i-l] } 

> 

list (beta=beta,mu=mu, sigma2=sigma2 , xcum=xcum) 

> 

In  the  above,  truncnorm  is  the  standard  simulator  of  a  truncated  normal 
variate  based  on  the  inverse  cdf  (see  Robert  and  Casella,  2004,  Chap.  2,  for 
details). 
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In  the  case  of  the  Menteith  data,  we  use  a  four-neighbour  neighbourhood 
and  G  =  6  on  a  100  x  100  image.  For  [3  ranging  from  0  to  2  by  steps  of  0.1, 
the  approximation  to  /(/3)  is  based  on  1,500  iterations  of  Algorithm  8.17 
(after  burn-in),  following  the  same  procedure  as  with  Fig.  8.5.  The  resulting 
piecewise-linear  function  is  given  in  Fig.  8.11  and  is  smooth  enough  for  us  to 
consider  the  approximation  as  acceptable.  Note  that  the  increasing  nature  of 
the  function  /  in  f3  is  intuitive:  As  f3  grows,  the  probability  of  having  more 
neighbors  of  the  same  category  increases  and  so  does  S'(x). 


o 

o 


Fig.  8.11.  Approximation  of  /(d)  f°r  the  Potts  model  on  a  100  x  100  image,  a 
four-neighbour  neighbourhood,  and  G  =  6,  based  on  1,500  MCMC  iterations  after 
burn-in 


Figures  8.12-8.14  illustrate  the  convergence  performances  of  the  hybrid 
Gibbs  sampler  for  Menteith.  In  that  case,  using  h  =  0.05  shows  that  2,000 
MCMC  iterations  are  sufficient  for  convergence.  (Recall,  however,  that  x  is  a 
100  x  100  image  and  thus  that  a  single  Gibbs  step  implies  simulating  the  value 
of  104  pixels.  This  comes  in  addition  to  the  cost  of  approximating  the  ratio  of 
normalizing  constants. )  All  histograms  are  smooth  and  unimodal,  even  though 
the  moves  on  [3  are  more  difficult  than  for  the  other  components.  (Different 
values  of  h  were  tested  for  this  dataset  and  none  improved  this  behavior.)  Note 
that  large  images  like  Menteith  often  lead  to  a  very  concentrated  posterior 
on  j3.  (Other  starting  values  for  f3  were  also  tested  to  check  for  the  stability 
of  the  stationary  region.) 


We  recall  that  the  primary  purpose  of  this  image  analysis  is  to  clean 
(de-noise)  and  to  classify  into  G  categories  the  pixels  of  the  image.  Based 
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on  the  MCMC  output  and  in  particular  on  the  chain  (x(t^)i <t<T  (where  T 
is  the  number  of  MCMC  iterations),  an  estimator  of  x  needs  to  be  derived 
through  an  evaluation  of  the  consequences  of  wrong  allocations.  Two  common 
ways  of  running  this  evaluation  are  either  to  count  the  number  of  (individual) 
pixel  misclassification, 


Li(x,x)  =  52  Wa*, 

iex 

or  to  use  the  global  “zero-one”  loss  function  (see  Sect.  2.3.1), 

L2(x,x)  =  I*/*  , 

which  amounts  to  saying  that  only  a  perfect  reconstitution  of  the  image  is 
acceptable  (and  thus  sounds  rather  extreme  in  its  requirements).  It  is  then 
easy  to  show  that  the  estimators  associated  with  these  loss  functions  are  the 
marginal  posterior  mode  (MPM),  xMPM;  that  is,  the  image  made  of  the  pixels 

xf1PM  =  arg  max  P’r (xi  =  g\y) ,  i  el, 

1  <9<G 


Fig.  8.12.  Dataset  Menteith:  Sequence  of  /ig' s  based  on  2,000  iterations  of  the 
hybrid  Gibbs  sampler  ( read  row-wise  from  /i\  to  no) 
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Ton  JT3,s  tJ.d  ?i  .a  73. d  72  & 


- 1 - 1 - 1 - 1 

50.0  M.O  09  H  «J.O  l»  3 


I - 1 - 1 - 1 - 1 


EO  O  S3. 3  04  D  4)4  &  B&O 


Fig.  8.13.  Dataset  Menteith:  Histograms  of  the  fig’s  represented  in  Fig.  8.12 


Fig.  8.14.  Dataset  Menteith:  Raw  plots  and  histograms  of  the  cr2,s  and  /3’s  based 
on  2,000  iterations  of  the  hybrid  Gibbs  sampler  ( the  first  row  corresponds  to  a2) 


and  the  maximum  a  posteriori  estimator  (2.4), 

~  MAP  t  i  \ 

x  =  argmax7r(x|y) , 

X 

respectively.  Note  that  it  makes  sense  that  the  xMPM  estimator  only  depends 
on  the  marginal  distribution  of  the  pixels,  given  the  linearity  of  the  loss  func¬ 
tion.  Both  loss  functions  are  nonetheless  associated  with  image  reconstruction 
rather  than  true  classification  (Exercise  8.14). 
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The  estimators  xMPM  and  xMAP  obviously  have  to  be  approximated  since 
the  marginal  posterior  distributions  7r(a^|y)  (i  E  T)  and  7r(x|y)  are  not  avail¬ 
able  in  closed  form.  The  marginal  distributions  of  the  being  by-products 
of  the  MCMC  simulation  of  x,  we  can  use,  for  instance,  as  an  approximation 
to  xMjPM  the  most  frequent  occurrence  of  each  pixel  i  Gl, 


- MPM 
Li 


max 


i 


based  on  a  simulated  sequence,  . . . ,  ySN\  from  the  posterior  distribution 
of  x.  (This  is  not  the  most  efficient  approximation  to  xMPM,  obviously,  but  it 
comes  as  a  cheap  by-product  of  the  MCMC  simulation  and  it  does  not  require 
the  use  of  more  advanced  simulated  annealing  tools,  mentioned  in  Sect.  6.7.) 

Unfortunately,  the  same  remark  cannot  be  made  about  xMAP:  the  state 
space  of  the  simulated  chain  (x^)i <t<T  is  so  huge,  being  of  cardinality 
Giooxioo,  that  it  is  completely  unrealistic  to  look  for  a  proper  MAP  esti¬ 
mate  out  of  the  sequence  (x^)i<t<T-  Since  7r(x|y)  is  not  available  in  closed 
form,  even  though  this  density  could  be  approximated  by 


T 


Vx|y)  oc  y^7r(x|y,/3w, ygo-W) , 


t=  1 


thanks  to  a  Rao-Blackwellization  argument,  it  is  rather  difficult  to  propose  a 
foolproof  simulated  annealing  that  converges  to  xMAP  (although  there  exist 
cheap  approximations;  see  Exercise  8.15). 

The  segmented  image  of  Lake  Menteith  is  given  by  the  MPM  estimate 
that  was  found  after  2,000  iterations  of  the  Gibbs  sampler.  We  reproduce  in 
Fig.  8.15  the  original  picture  to  give  an  impression  of  the  considerable  im¬ 
provement  brought  by  the  algorithm. 


8.5  Exercises 

8.1  Find  two  conditional  distributions  f(x\y)  and  g(y\x)  such  that  there  is  no  joint 
distribution  corresponding  to  both  /  and  g.  Find  a  necessary  condition  for  /  and  g  to 
be  compatible  in  that  respect;  i.e.,  to  correspond  to  a  joint  distribution  on  (x,y). 

8.2  Using  the  Hammersley-Clifford  theorem,  show  that  the  full  conditional  distribu¬ 
tions  given  by  (8.3)  are  compatible  with  a  joint  distribution.  Deduce  that  the  Ising  model 
is  a  Markov  random  field. 

8.3  If  a  joint  density  7r(yi, . . . ,  yn)  is  such  that  the  conditionals  n(y-i\yi)  never  cancel 

on  the  supports  of  the  marginals  show  that  the  support  of  n  is  equal  to  the 

Cartesian  product  of  the  supports  of  the  marginals. 
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Fig.  8.15.  Dataset  Menteith:  (top)  Segmented  image  based  on  the  MPM  estimate 
produced  after  2,000  iterations  of  the  Gibbs  sampler  and  ( bottom )  the  observed  image 


8.4  Describe  the  collection  of  cliques  C  for  an  eight-neighbour  neighbourhood  structure 
such  as  in  Fig.  8.2  on  a  regular  n  x  m  array.  Compute  the  number  of  cliques. 

8.5  Draw  the  function  Z(f3)  for  a  3  x  5  array.  Determine  the  computational  cost  of 
the  derivation  of  the  normalizing  constant  Z(f3)  of  (8.4)  for  an  m  x  n  array. 

8.6  Show  that  the  joint  distribution  (8.5)  is  indeed  compatible  with  the  full  conditionals 
of  the  Potts  model.  Can  you  derive  this  joint  distribution  from  the  Hammersley-Clifford 
representation  (8.1)? 

8.7  For  an  n  x  m  array  X,  if  the  neighbourhood  relation  is  based  on  the  four  nearest 
neighbors,  show  that  the  Xij’s  for  which  (i+j)  =  0(mod  2)  are  independent  conditional 
on  the  Xij’s  for  which  (i  +  j)  =  l(mod  2)  (1  <  i  <  n,  1  <  j  <  m).  Deduce  that  the 
update  of  the  whole  image  can  be  done  in  two  steps  by  simulating  the  pixels  with  even 
sums  of  indices  and  then  the  pixels  with  odd  sums  of  indices.  (This  modification  of 
Algorithm  8.16  is  a  version  of  the  Swendsen-Wang  algorithm.) 

8.8  Determine  the  computational  cost  of  the  derivation  of  the  normalizing  constant 
of  the  distribution  (8.5)  for  an  n  x  m  array  and  G  different  colors. 


8.5  Exercises 


283 


8.9  Use  the  Hammersley-Clifford  theorem  to  establish  that  (8.5)  is  the  joint  distribu¬ 
tion  associated  with  the  conditionals  above.  Deduce  that  the  Potts  model  is  an  MRF. 


8.10  Derive  an  alternative  to  Algorithm  8.17  where  the  probabilities  in  the  multinomial 
proposal  are  proportional  to  the  numbers  of  neighbors  nUe,g  and  compare  its  performance 
with  that  of  Algorithm  8.17. 

8.11  Show  that  the  Swendsen-Wang  improvement  given  in  Exercise  8.7  also  applies 
to  the  simulation  of  7r(x|y,  cr2,  /x). 

8.12  Using  a  piecewise-linear  interpolation  of  f(/3)  based  on  the  values 
/(/31),. . .  ,/(/?M),  with  0  <  (h  <  ...  <  (3m  =  2,  give  the  explicit  value  of  the 
integral 

ra i  . 

/  m  d/3 

3  O'  o 

for  any  pair  0  <  ao  <  a±  <  2. 


8.13  Show  that  the  estimators  x  that  minimize  the  posterior  expected  losses 
E7r[Li(x,  x)|y)]  and  E7r[L2(x,  x)|y]  are  xMPM  and  xMAP,  respectively. 

8.14  Determine  the  estimators  x  associated  with  two  loss  functions  that  penalize 
differently  the  classification  errors, 

E3(x,  x)  —  ^  ^  I Xi—Xj  ^-x^Xj  and  Z/4(x,  x)  —  ^  ^  ^-x^xj  ^-x^=xj  • 

i,j<ET  i,j£Z 


8.15  Since  the  maximum  of  7r(x|y)  is  the  same  as  that  of  7r(x|y)K  for  every  kGN, 
show  that 


7r(x|y)K  =  J  7r(x,  9i  |y)  d9i  x  •  •  •  x  f  7r(x,£>K|y)  dOK  , 


(8.9) 


where  Oi  —  (^,/i^cr2)  (1  <  i  <  k).  Deduce  from  this  representation  an  optimization 
scheme  that  slowly  increases  k  over  iterations  and  that  runs  a  Gibbs  sampler  for  the 
integrand  of  (8.9)  at  each  iteration. 

8.16  For  the  Ising  model,  show  that  the  distribution  (8.4)  can  be  also  defined  as 

7r(x)  oc  exp  I  2 f3  ^  lXj=x.=1 

\  j~i 

when  the  number  of  neighbors  is  constant. 

8.17  Show  that  the  joint  distribution  (8.4)  can  be  obtained  from  the  full  conditionals 
(8.3)  by  virtue  of  the  Flammersley-Clifford  representation  (8.1). 

8.18  Show  that  the  Ising  distribution  is  symmetric  in  that  inverting  the  color  of  all 
pixels  does  not  change  the  probability  (8.4). 

8.19  For  the  Ising  model,  run  a  simulation  experiment  that  should  locate  the  limiting 
value  of  [3  above  which  almost  all  pixels  are  of  the  same  color.  Same  question  for  the 
(negative)  limiting  value  of  (3  below  which  the  image  is  a  perfect  checkerboard. 

8.20  Show  that  the  ABC  algorithm  implemented  with  e  =  0  and  a  distance  between 
sufficient  statistics  is  not  approximate  in  that  the  output  is  truly  simulated  from  the 
posterior  distribution  tt^x)  oc  /(x|#)7t(0). 
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Confidence  interval,  37 

Conjugacy,  34 

Conjugate  prior,  34,  35 

for  mixture  of  distributions,  184 
Constant 

normalizing,  45,  159,  220 
for  the  Ising  model,  257 
for  the  Potts  model,  273 
prior,  36 
Zellner’s  G,  76 
Contingency  table,  127,  128 
Correlation,  86 
Covariate,  66 
CRAN,  8 
Credible  set,  38 
Critical  value,  260,  262 
Curse  of  dimensionality,  192 

Darroch  model,  see  Model 
Data  augmentation,  182 
Data- dependent  prior,  76 
datha,  174 
De-noise,  279 
de  Finetti,  Bruno,  122 
Dependence,  211 
Detailed  balance,  110 
Dichotomous  data,  104,  128 
Distribution 
beta,  176 


beta-Pascal,  145 
binomial,  116,  142 
Dirichlet,  246 
hypergeometric,  146 
inverse  gamma,  30 
mixture,  176 
nonstandard,  156 
normal,  26 
Poisson,  128 
predictive,  59,  84,  99 
stationary,  88 
Student  t,  31,  62,  78 
Weibull,  62 
DNA,  238 
Dnadataset,  238 

Effective  sample  size,  112,  213 
Elicitation,  122 
EM  algorithm,  see  Algorithm 
Empirical  Bayes  analysis,  76 
Entropy,  196 
Equation 

backward,  244 
detailed  balance,  110 
forward,  244 
forward-backward,  243 
Ergodicity,  86 
Estimation 

of  mixture  parameters,  196 
versus  testing,  201,  226 

eurodip,  141 

European  dipper,  141 
Eurostoxx50,  210 

Explanation  vs.  interpretation,  2 
Explanatory  variable,  66 

Factor,  69 

Fisher  information,  116 
Forward-backward  formulas,  243 
Function 
beta,  145 

Fundamental  theorem  of  simulation, 
156 

Galaxy  dataset,  204 
Generalized  linear  model  (GLM),  see 
Model 

Gibbs  sampler,  see  Algorithm 
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GLM  ,  see  Generalized  linear  model 
Goodness  of  fit,  39 

Harmonic  mean,  57 
Heteroscedasticity,  235 
Hidden  Markov  model,  see  Model 
HIV,  238 

HPD  region,  38,  79 
Hyperparameter,  see  Prior 
Hypothesis 
point  null,  41 
testing,  38,  71,  82 
versus  estimation,  201 

Identifiability,  69,  130,  194,  241 

Illingworthh,  K.K.,  27 

Image,  252 
noisy,  273 

Imaginary  observations,  122 
Importance  sampling,  49,  50 

for  marginal  approximation,  119 
Independent,  identically  distributed 
(iid),  28 
Intercept,  69 

International  Whaling  Commission,  140 
Invariance 

by  permutation,  221 
Invariance  under  permutations,  193 
Irreducibility,  86,  88 
Ising  model,  see  Model 

JAGS,  3,  6 

Jeffreys,  Harold,  36 

Jeffreys’  scale  of  evidence,  41 
Jeffrey s-Lindley  paradox,  42,  44,  95 

Kalman  filter,  232 

Label  switching,  192-197,  200,  241,  274 
and  Chib’s  method,  203 
Lag,  213 
Laichedata,  253 
Lattice,  252,  253 
Least  squares  estimate,  72 
Lexicographical  ordering,  180 
License,  197 
Likelihood,  28 

likelihood,  -free  inference,  261 


Link,  106 

canonical,  107 
log,  108 
logit,  107 
probit,  108 

Local  versus  global  variables,  20 
Loch,  273 

Log-linear  model,  see  Model 
Log-odds  ratio,  107 
Logit  model,  see  Model 
Loss  function,  37,  279 

MA  model,  see  Model 
MAP,  see  Maximum  a  posteriori 
Marginal  distribution,  38,  79 
approximation,  119 
Marginal  likelihood,  55,  203 
Marginal  posterior  mode  (MPM),  279 
Markov 

kernel,  85,  87 
random  held,  254,  255 
switching,  247 

Markov  chain,  85,  109,  217,  218,  253 
definition,  210 
hidden,  161 

homogeneous,  210,  237 
slow  mixing,  258 

Markov  Chain  Monte  Carlo  (MCMC), 
47,  85 

birth-and-death,  220 
Maximum  a  posteriori,  33,  195,  279 

Menteith,  278 
Menteith,  273 

Metropolis-Hastings  algorithm,  see 
Algorithm 

Michelson-Morlay  experiment,  27 
Military  conscripts,  176 
Missing  variable,  153 
Mixture,  see  Distribution 
Mixture  model,  see  Model 
Model 

ANOVA  (analysis  of  variance),  129 

AR(1),  215 

AR(p),  216 

ARCH(p),  235 

ARMA,  232,  234 

Arnason- Schwarz,  160-168 

averaging,  93 

binomial,  142 
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capture-mark-recapture,  143,  210 
Darroch,  146 
dynamic,  210 
full,  93 

generalized  linear,  16,  106 
hidden  Markov,  174,  237-250 
hypergeometric,  146 
Ising,  256,  258 
latent  variable,  108 
log- linear,  127 
logit,  107,  124 
MA,  226 
MA (q),  227,  233 
Markov,  228 
Markov-switching,  247 
mixture,  176 
Potts,  252,  256,  260,  275 
inference  on,  270 
probit,  115 
regression,  67 
saturated,  128 
stochastic  volatility,  235 
temporal,  210 

T-stage  capture-recapture,  148 
two-stage  capture,  147 
variable  dimension,  see  Variable 
dimension 

Model  choice,  38-58,  130 
Monte  Carlo 
estimate,  49 
methods,  3 

MRF,  see  Markov  random  held 
Multimodality,  26 

Noise,  273,  279 
white,  215,  236 

Normal  distribution,  see  Distribution 

normaldata,  27 

Normalizing  constant,  62,  261-264 
Numerical  quadrature,  265 

Occam’s  razor,  91 
Optimality,  38 
Outcome,  66 
Outlier,  60 
Overfitting,  91 


Parameter 

common  to  several  models,  40 
interest,  142 
nuisance,  142,  274 
Parsimony,  42,  91,  234 
Partition,  179 
Path  sampling,  264 
Phase  transition,  260,  262 
Pilot  run,  90 
Pivot,  196 
Plug-in,  126 
Polynomial 
lag,  219 
root,  220 
Population 
closed,  144 
sub,  176 
Posterior,  29 
proper,  73 

Potts  model,  see  Model 
Prediction 
filter,  244 
Prior 

construction  of,  122 
elicitation,  122 
hat,  36,  124 
hyper-,  35 
hyperparameter,  34 
improper,  35,  37,  43,  73 
index,  36 
Jeffreys,  36,  219 
noninformat  ive,  35 
selection,  34 
subjective,  29 
Probit  model,  see  Model 
Process 

Dirichlet,  202 
future-independent,  215 
invertible,  227 
nonstationary,  214 
stationary,  212 
stochastic,  210 
Proposal,  109 
choice,  113 
p- value,  71 
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R,  5 

airquality,  128 
apply,  13 
arima,  231 
array,  11 
as. matrix,  129 
bayess,  4 
color,  156 
combinat,  205 
contour,  178 
data,  128 
depository,  19 
dump,  21 
factor,  11 
function,  19 
glm,  131 

graphical  commands,  19 

help,  5 

image,  178,  207 
integrate,  266 
intToBits,  94 
is. matrix,  129 
jitter,  156 
list,  14 

Im,  16,  69,  83 

matrix,  11 
mnormt,  32,  121 
morley,  27 

packages,  8 

plot,  16 

probability  distributions,  16 
programming,  19,  20 

quit,  21 
rm,  21 
scan,  21 
solve,  99 
stat,  183 
vector,  9 

Random  field,  254 
Random  number  generator,  156 
Random  walk,  111,  190,  248 
Rankin,  Ian,  1 
Rao-Blackwellization,  281 
Recurrence,  86 
Regression  model,  see  Model 
Reordering  MCMC  output,  196 
Reparameterization 
root,  219,  228,  234 
weight,  192 


Response,  66 

Reversible  jump  MCMC,  see  Algorithms 
Rk wards,  7 

Saddle  point,  186 
Satellite  image,  273 
scale,  69 
Scotland,  273 

Sequential  Monte  Carlo  sampler,  233 
Significance,  71 

Simulated  annealing,  199,  207,  281 

Simulation,  47 

Skewness,  26 

Slice  sampler,  170 

SMC,  233 

State-space  representation,  212,  232, 
233,  237 
Stationarity 
constraint,  214 
lack  of,  216 

second-order,  212,  213 
strict,  212 
Statistics,  3,  210 

nonpar  ametric,  16,  198 
semiparametric,  177 
Step 

E  and  M,  181 

Stochastic  volatility  model,  see  Model 

Stock  market,  210 

Stopping  rule,  88 

Sufficiency,  29 

Survey,  140 

Target,  109 
Tempering,  199 
Test 

Schur-Cohn,  249 
Testing 

versus  estimation,  226 
Theorem 

Gauss-Markov,  70 
Hammersley-Clifford,  256,  281 
Rao-B  lack  well,  203 
Tolerance  (for  ABC),  268 
T-stage  capture-recapture  model,  see 
Model 

Two-stage  capture  model,  see  Model 
Unit  circle,  216 
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Variable 

categorical,  127 
control,  66 
dummy,  69 
latent,  108,  235 

Variable-dimension  model,  225 


Volatility,  210 
Volume,  38 

White  noise,  see  Noise 

Zellner,  Arnold,  76 


